[Updating only the Analytics list]

Hi everybody,

I forgot to update this email thread last week. The Event Logging master
database switch went fine but as reported the maintenance window affected
the Eventlogging schema graphs in the Eventlogging Schema dashboard. For
example, this is how the Popups schema looked like:

https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&from=1510704000000&to=1510790399000&var-schema=Popups

The gaps are not related to data loss in mysql or data inconsistency,
because those are only Kafka throughput metrics. A little refresh about how
the events are flowing:

Browser --> Varnish cache layer (text/upload) --> Varnishkafka (running on
the caching hosts) --> Kafka cluster <---> Eventlogging ---> Mysql
databases (Eventlogging Master)

I completely stopped the Eventlogging Service while switching the master
database and hence its Kafka consumer metrics reflected this, dropping to
zero (and spiking up when EL was started back again). The event timestamps
are set by Varnishkafka so this action did not affect the final data
quality.

This maintenance raised a bit of questions in
https://phabricator.wikimedia.org/T179914#3764603, apologies for the
trouble and the time wasted :(

Good news is that the master database was switched without any data loss
and we are now using a more powerful host!

Thanks!

Luca

2017-11-14 18:59 GMT+01:00 Luca Toscano <ltosc...@wikimedia.org>:

> Hi everybody,
>
> the Analytics team needs to do the following maintenance operations:
>
> 1) migrate the Event-Logging master db ('log', currently on db1046) to the
> new host db1107 (T156844). This should happen on *Wed Nov 15th (EU
> morning)*, and it should be transparent to all the Event Logging users.
> The only drawback that might be observed is a delay in getting the latest
> records on the analytics db replicas (db1108, db1047, dbstore1002).
>
> 2) Reboot thorium and all the stat boxes for Linux kernel updates.
>
> - Thorium hosts all the analytics websites like pivot.wikimedia.org,
> yarn.wikimedia.org, analytics.wikimedia.org, etc.. and will be rebooted
> on *Wed Nov 15th (EU morning)*, the websites downtime should be minimal
> (range of minutes).
> - stat boxes (stat1004, stat1005, stat1006) are usually running a lot of
> screen/tmux sessions with various data crunching activities, so I'll try to
> follow up with all the users currently running something on them to verify
> if I can proceed or not. I'd tentatively schedule the reboots on *Thu Nov
> 16h (EU morning)*, but please follow up with me asap if this needs to be
> postponed.
>
> Thanks in advance and sorry for the trouble!
>
> Luca (on behalf of the Analytics team)
>
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to