Please take a look at the preliminary outage report (with pretty
pictures!). TL;DR: Kafka had a small outage and eventlogging is not
resilient enough to deal with those, the reboot that Ori did brought
evenlogging back up. We have measures in place to deal with sql insertion
after an event like this one but, at this time we need to verify that the
SQL insertion has catched up with its backlog.

https://wikitech.wikimedia.org/wiki/Incident_documentation/20151127-EventLogging

On Fri, Nov 27, 2015 at 8:35 AM, Nuria Ruiz <[email protected]> wrote:

> >Unfortunately, the only team-members working full-time yesterday and
> today are we Europe folks.
> >We weren't there when that happened and we don't get those alerts on the
> phone, we should though.
> Given that this system is tier-2 i do not think we need an immediate
> response, 24 hours should be an acceptable ETA. I would say even 48.
>
> On Fri, Nov 27, 2015 at 2:31 AM, Marcel Ruiz Forns <[email protected]>
> wrote:
>
>> Thanks, Ori, for having a look at this and restarting EL.
>>
>> I understand it was 01:30 UTC on Friday (today), not Thursday. It went
>> on during 5-6 hours.
>> Unfortunately, the only team-members working full-time yesterday and
>> today are we Europe folks.
>> We weren't there when that happened and we don't get those alerts on the
>> phone, we should though.
>>
>> This problem happened already like a month ago. We'll backfill the
>> missing events and will investigate.
>> Thanks again for the heads-up.
>>
>> On Fri, Nov 27, 2015 at 8:01 AM, Ori Livneh <[email protected]> wrote:
>>
>>> On Thu, Nov 26, 2015 at 10:46 PM, Ori Livneh <[email protected]> wrote:
>>>
>>>> Seems that eventlog1001 has not received any events since 01:30 UTC on
>>>> Thursday
>>>>
>>>>
>>>> http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Miscellaneous+eqiad&h=eventlog1001.eqiad.wmnet&jr=&js=&event=hide&ts=0&v=140128.28&m=bytes_in&vl=bytes%2Fsec&ti=Bytes+Received
>>>>
>>>> This is pretty severe; I'd page if it wasn't a US holiday.
>>>>
>>>
>>> Kafka clients on eventlog1001 were in a "Autocommitting consumer offset"
>>> death-loop and not receiving any events from the Kafka brokers. I ran
>>> eventloggingctl stop / eventloggingctl start and they recovered. Needs to
>>> be investigated more thoroughly. Otto, can you follow up?
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>>
>> --
>> *Marcel Ruiz Forns*
>> Analytics Developer
>> Wikimedia Foundation
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to