>This sounds like the fixes we did last quarter to the batch insertion
basically hid the problem instead of making it go away.
I think we are mixing things here, when we had issues with batching code we
never saw a pattern of "no-events-whatsoever-in-any-table for an hour". We
saw events dropped in bursts here and there but certainly not  an "hour
long blackout".

Also, there were no events dropped when we did the major backfilling in
early march where the db sustained quite a bit of load as we had to insert
those one by one.

So (while I am not saying we could not uncover a code issue in our end) we
have not seen this particular error pattern before.









On Wed, Apr 15, 2015 at 3:00 PM, Dan Andreescu <[email protected]>
wrote:

> This sounds like the fixes we did last quarter to the batch insertion
> basically hid the problem instead of making it go away.
>
> On Wed, Apr 15, 2015 at 5:58 PM, Marcel Ruiz Forns <[email protected]>
> wrote:
>
>> Hi Sean,
>>
>>
>>> *However*, the consumer logs indicate the insert timestamp only, not the
>>> event timestamp (which goes to the table). So it could be that there's some
>>> data loss inside the consumer code (or in zmq?) that wouldn't stop the
>>> write flow, but would skip a segment of events. I'll look deeper into this.
>>
>>
>> We've deployed an improvement on the EL mysql consumer logs, to be sure
>> that the events that were being inserted at the time of the DB gaps
>> corresponded indeed with the missing data. And we've seen that the response
>> is yes, the consumer executes the missing inserts in time and without
>> errors in the sqlalchemy client.
>>
>> Can you supply some specific records from the EL logs with timestamps
>>> that should definitely be in the database, so we can scan the database
>>> binlog for specific UUIDs or suchlike?
>>
>>
>> Here are three valid events that were apparently inserted correctly, but
>> don't appear in the db.
>> http://pastebin.com/8wm6qkkE
>> (they are performance events and contain no sensitive data)
>>
>> -- can you give me some idea of how long your "at other moments" delay
>>> is?
>>
>>
>> I followed the master-slave replication lag for some hours, and perceived
>> a pattern in the lag: It gets progressively bigger with time, more or less
>> with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At that
>> point, the data gap happens and the replication lag goes back to few
>> minutes lag. I could only catch a data gap "live" 2 times, so that's
>> definitely not a conclusive statement. But, there's this hypothesis that
>> the two problems are related.
>>
>> Sean, I hope that helps answering your questions.
>> Let us know if you have any idea on this.
>>
>> Thank you!
>>
>> Marcel
>>
>> On Tue, Apr 14, 2015 at 9:15 PM, Marcel Ruiz Forns <[email protected]>
>> wrote:
>>
>>> Sean, thanks for the quick response:
>>>
>>>
>>>> We have a binary log on the EL master that holds the last week of
>>>> INSERT statements. It can be dumped and grepped, eg looking at
>>>> 10-minute blocks around 2015-04-13 16:30:
>>>>
>>>
>>> Good to know!
>>>
>>> Zero(!) during 10min after 16:30 doesn't look good. This means the
>>>> database master did not see any INSERTs with 20150413163* timestamps
>>>> within the last week.
>>>
>>>
>>> Ok, makes more sense.
>>>
>>>
>>>> Can you describe how you know that events were
>>>> written normally? Simply a lack of errors from mysql consumer?
>>>>
>>>
>>> MySQL consumer log not only lacks errors, but has records of successful
>>> writes to the db at the time of the problems. Also, the processor logs
>>> indicate homogeneous throughput of valid events during all times.
>>>
>>> *However*, the consumer logs indicate the insert timestamp only, not the
>>> event timestamp (which goes to the table). So it could be that there's some
>>> data loss inside the consumer code (or in zmq?) that wouldn't stop the
>>> write flow, but would skip a segment of events. I'll look deeper into this.
>>>
>>>
>>>> Can you supply some specific records from the EL logs with timestamps
>>>> that should definitely be in the database, so we can scan the database
>>>> binlog for specific UUIDs or suchlike?
>>>>
>>>
>>> I'll try to get those.
>>>
>>> -- can you give me some idea of how long your "at other moments" delay
>>>> is?
>>>>
>>>
>>> I'll observe the db during the day and give you an estimate.
>>>
>>> Thanks!
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to