[jira] [Commented] (FLINK-27721) Slack: set up archive

Xintong Song (Jira) Sun, 17 Jul 2022 21:05:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-27721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567803#comment-17567803
 ]


Xintong Song commented on FLINK-27721:
--------------------------------------

Status updates:

Right now, I have something imperfect but workable. I probably won't have time 
to further improve it recently. Given that we are approaching the 10k messages 
limit very soon, I'll try to deploy the current version.

The known limitations are:
 # *Messages are not organized in threads at frontend, making it hard for 
people to read.* This is the same limitation that 
[airflow|http://apache-airflow.slack-archives.org/] also has. Properties needed 
for grouping messages into threads are already captured in the database. All we 
need is to improve the way the messages are displayed.
 # *It's not realtime.* Slack's new event api never worked for me. So I went 
for an approach that periodically fetches the messages, with a configurable 
interval (default 1h). Consequently, new messages may take up to 1 hour to 
appear in the archive, which is probably fine because they can be searched in 
Slack anyway.
 # *It's unlikely, but still possible, to loose messages.* With Slack's 
conversation api, we need to first retrieve parent messages that are directly 
sent to the channel, then for each of them retrieve threaded messages replying 
to it. That means for an already retrieved thread, we cannot know whether 
there're new replies to it without trying to retrieve it again. Moreover, the 
api has a ~50/min rate limit, so we probably should not frequently retrieve 
replies for all threads. My current approach is to only retrieve new messages 
for threads started within the recent 30 days (configurable). That means new 
replies to a thread started more than 30 days ago can be lost, which I'd expect 
to be very rare.
 # *Backup is not automatic.* We can dump the database with one command, 
without interrupting the service. We just need to setup a cronjob to trigger 
and handle the dumps (uploading & cleaning).

Some numbers, FYI:
# [Slack Analytics|https://apache-flink.slack.com/admin/stats] shows we now 
have 9.1k total messages. In the last 30 days, only 31% of messages are sent in 
public channels, 67% in DMs and 1% in private channels.
# Slack archive captures public channel messages only. It captures 2.5k total 
messages, taking about 7~8 minutes on my laptop. The bottleneck is the Slack's 
api rate limit.
# A full dump of the database, containing all the 2.5k messages, channel & user 
information, completes almost instantly. The dumped file is 3.7MB large.

I'll try to deploy the service next. Based on the numbers, I think a dedicated 
VM might not be necessary. So I'd try with the flink-packages host first. BTW, 
I have already backed up a dump of all public messages so far, so it shouldn't 
be a problem if the service is not deployed by the time the 10k limit is 
reached. 


> Slack: set up archive
> ---------------------
>
>                 Key: FLINK-27721
>                 URL: https://issues.apache.org/jira/browse/FLINK-27721
>             Project: Flink
>          Issue Type: Sub-task
>            Reporter: Xintong Song
>            Assignee: Xintong Song
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-27721) Slack: set up archive

Reply via email to