This sounds like the fixes we did last quarter to the batch insertion basically hid the problem instead of making it go away.
On Wed, Apr 15, 2015 at 5:58 PM, Marcel Ruiz Forns <[email protected]> wrote: > Hi Sean, > > >> *However*, the consumer logs indicate the insert timestamp only, not the >> event timestamp (which goes to the table). So it could be that there's some >> data loss inside the consumer code (or in zmq?) that wouldn't stop the >> write flow, but would skip a segment of events. I'll look deeper into this. > > > We've deployed an improvement on the EL mysql consumer logs, to be sure > that the events that were being inserted at the time of the DB gaps > corresponded indeed with the missing data. And we've seen that the response > is yes, the consumer executes the missing inserts in time and without > errors in the sqlalchemy client. > > Can you supply some specific records from the EL logs with timestamps >> that should definitely be in the database, so we can scan the database >> binlog for specific UUIDs or suchlike? > > > Here are three valid events that were apparently inserted correctly, but > don't appear in the db. > http://pastebin.com/8wm6qkkE > (they are performance events and contain no sensitive data) > > -- can you give me some idea of how long your "at other moments" delay >> is? > > > I followed the master-slave replication lag for some hours, and perceived > a pattern in the lag: It gets progressively bigger with time, more or less > with a 10 minute increase per hour, reaching lags of 1 to 2 hours. At that > point, the data gap happens and the replication lag goes back to few > minutes lag. I could only catch a data gap "live" 2 times, so that's > definitely not a conclusive statement. But, there's this hypothesis that > the two problems are related. > > Sean, I hope that helps answering your questions. > Let us know if you have any idea on this. > > Thank you! > > Marcel > > On Tue, Apr 14, 2015 at 9:15 PM, Marcel Ruiz Forns <[email protected]> > wrote: > >> Sean, thanks for the quick response: >> >> >>> We have a binary log on the EL master that holds the last week of >>> INSERT statements. It can be dumped and grepped, eg looking at >>> 10-minute blocks around 2015-04-13 16:30: >>> >> >> Good to know! >> >> Zero(!) during 10min after 16:30 doesn't look good. This means the >>> database master did not see any INSERTs with 20150413163* timestamps >>> within the last week. >> >> >> Ok, makes more sense. >> >> >>> Can you describe how you know that events were >>> written normally? Simply a lack of errors from mysql consumer? >>> >> >> MySQL consumer log not only lacks errors, but has records of successful >> writes to the db at the time of the problems. Also, the processor logs >> indicate homogeneous throughput of valid events during all times. >> >> *However*, the consumer logs indicate the insert timestamp only, not the >> event timestamp (which goes to the table). So it could be that there's some >> data loss inside the consumer code (or in zmq?) that wouldn't stop the >> write flow, but would skip a segment of events. I'll look deeper into this. >> >> >>> Can you supply some specific records from the EL logs with timestamps >>> that should definitely be in the database, so we can scan the database >>> binlog for specific UUIDs or suchlike? >>> >> >> I'll try to get those. >> >> -- can you give me some idea of how long your "at other moments" delay >>> is? >>> >> >> I'll observe the db during the day and give you an estimate. >> >> Thanks! >> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
