This sounds like the fixes we did last quarter to the batch insertion
basically hid the problem instead of making it go away.

On Wed, Apr 15, 2015 at 5:58 PM, Marcel Ruiz Forns <[email protected]>
wrote:

> Hi Sean,
>
>
>> *However*, the consumer logs indicate the insert timestamp only, not the
>> event timestamp (which goes to the table). So it could be that there's some
>> data loss inside the consumer code (or in zmq?) that wouldn't stop the
>> write flow, but would skip a segment of events. I'll look deeper into this.
>
>
> We've deployed an improvement on the EL mysql consumer logs, to be sure
> that the events that were being inserted at the time of the DB gaps
> corresponded indeed with the missing data. And we've seen that the response
> is yes, the consumer executes the missing inserts in time and without
> errors in the sqlalchemy client.
>
> Can you supply some specific records from the EL logs with timestamps
>> that should definitely be in the database, so we can scan the database
>> binlog for specific UUIDs or suchlike?
>
>
> Here are three valid events that were apparently inserted correctly, but
> don't appear in the db.
> http://pastebin.com/8wm6qkkE
> (they are performance events and contain no sensitive data)
>
> -- can you give me some idea of how long your "at other moments" delay
>> is?
>
>
> I followed the master-slave replication lag for some hours, and perceived
> a pattern in the lag: It gets progressively bigger with time, more or less
> with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At that
> point, the data gap happens and the replication lag goes back to few
> minutes lag. I could only catch a data gap "live" 2 times, so that's
> definitely not a conclusive statement. But, there's this hypothesis that
> the two problems are related.
>
> Sean, I hope that helps answering your questions.
> Let us know if you have any idea on this.
>
> Thank you!
>
> Marcel
>
> On Tue, Apr 14, 2015 at 9:15 PM, Marcel Ruiz Forns <[email protected]>
> wrote:
>
>> Sean, thanks for the quick response:
>>
>>
>>> We have a binary log on the EL master that holds the last week of
>>> INSERT statements. It can be dumped and grepped, eg looking at
>>> 10-minute blocks around 2015-04-13 16:30:
>>>
>>
>> Good to know!
>>
>> Zero(!) during 10min after 16:30 doesn't look good. This means the
>>> database master did not see any INSERTs with 20150413163* timestamps
>>> within the last week.
>>
>>
>> Ok, makes more sense.
>>
>>
>>> Can you describe how you know that events were
>>> written normally? Simply a lack of errors from mysql consumer?
>>>
>>
>> MySQL consumer log not only lacks errors, but has records of successful
>> writes to the db at the time of the problems. Also, the processor logs
>> indicate homogeneous throughput of valid events during all times.
>>
>> *However*, the consumer logs indicate the insert timestamp only, not the
>> event timestamp (which goes to the table). So it could be that there's some
>> data loss inside the consumer code (or in zmq?) that wouldn't stop the
>> write flow, but would skip a segment of events. I'll look deeper into this.
>>
>>
>>> Can you supply some specific records from the EL logs with timestamps
>>> that should definitely be in the database, so we can scan the database
>>> binlog for specific UUIDs or suchlike?
>>>
>>
>> I'll try to get those.
>>
>> -- can you give me some idea of how long your "at other moments" delay
>>> is?
>>>
>>
>> I'll observe the db during the day and give you an estimate.
>>
>> Thanks!
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to