[
https://issues.apache.org/jira/browse/FLINK-27721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567803#comment-17567803
]
Xintong Song commented on FLINK-27721:
--------------------------------------
Status updates:
Right now, I have something imperfect but workable. I probably won't have time
to further improve it recently. Given that we are approaching the 10k messages
limit very soon, I'll try to deploy the current version.
The known limitations are:
# *Messages are not organized in threads at frontend, making it hard for
people to read.* This is the same limitation that
[airflow|http://apache-airflow.slack-archives.org/] also has. Properties needed
for grouping messages into threads are already captured in the database. All we
need is to improve the way the messages are displayed.
# *It's not realtime.* Slack's new event api never worked for me. So I went
for an approach that periodically fetches the messages, with a configurable
interval (default 1h). Consequently, new messages may take up to 1 hour to
appear in the archive, which is probably fine because they can be searched in
Slack anyway.
# *It's unlikely, but still possible, to loose messages.* With Slack's
conversation api, we need to first retrieve parent messages that are directly
sent to the channel, then for each of them retrieve threaded messages replying
to it. That means for an already retrieved thread, we cannot know whether
there're new replies to it without trying to retrieve it again. Moreover, the
api has a ~50/min rate limit, so we probably should not frequently retrieve
replies for all threads. My current approach is to only retrieve new messages
for threads started within the recent 30 days (configurable). That means new
replies to a thread started more than 30 days ago can be lost, which I'd expect
to be very rare.
# *Backup is not automatic.* We can dump the database with one command,
without interrupting the service. We just need to setup a cronjob to trigger
and handle the dumps (uploading & cleaning).
Some numbers, FYI:
# [Slack Analytics|https://apache-flink.slack.com/admin/stats] shows we now
have 9.1k total messages. In the last 30 days, only 31% of messages are sent in
public channels, 67% in DMs and 1% in private channels.
# Slack archive captures public channel messages only. It captures 2.5k total
messages, taking about 7~8 minutes on my laptop. The bottleneck is the Slack's
api rate limit.
# A full dump of the database, containing all the 2.5k messages, channel & user
information, completes almost instantly. The dumped file is 3.7MB large.
I'll try to deploy the service next. Based on the numbers, I think a dedicated
VM might not be necessary. So I'd try with the flink-packages host first. BTW,
I have already backed up a dump of all public messages so far, so it shouldn't
be a problem if the service is not deployed by the time the 10k limit is
reached.
> Slack: set up archive
> ---------------------
>
> Key: FLINK-27721
> URL: https://issues.apache.org/jira/browse/FLINK-27721
> Project: Flink
> Issue Type: Sub-task
> Reporter: Xintong Song
> Assignee: Xintong Song
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)