> Another approach we discussed back in the day was setting up a canary script to send known good messages whose delivery is monitored. Aye, Jeff mentioned maybe doing that. Not a bad idea.
Jeff, aye, you are right. You wouldn’t be able to run the sequence number check on your saved data. Sorry, I forgot that it wasn’t just the full webrequest_text. You’d have to run another kafkatee output pipe then, to check unsampled sequence numbers, similar to how the packet-loss.cpp script worked with udp2log. On Fri, Jul 8, 2016 at 11:05 AM, Toby Negrin <[email protected]> wrote: > Another approach we discussed back in the day was setting up a canary > script to send known good messages whose delivery is monitored. This might > be a bit easier to set up. > > It's been effective on other systems I've worked on; also a good way to > measure delivery latency. > > -Toby > > > On Friday, July 8, 2016, Jeff Green <[email protected]> wrote: > >> On Fri, 8 Jul 2016, Andrew Otto wrote: >> >> We’ll, you won’t be able to do it exactly how we do, since we are loading >>> the data into Hadoop and then checking it there, so we use Hadoop tools. >>> Here’s what we got: >>> >>> >>> https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql >>> >>> >>> https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics_hourly.hql >>> >>> This old udp2log tool did a similar thing, so it is worth knowing about: >>> https://github.com/wikimedia/analytics-udplog/blob/master/srcmisc/packet-loss.cpp >>> However, it only worked with TSV udp2logs, and I think it won’t work with a >>> multi-partition kafka topic, since seqs could be out of order based on >>> partition read order. >>> >>> You guys do some kind of 15 (10?) minute roll ups, right? You could >>> probably do some very rough guesses on data loss in each 15 minute bucket. >>> You’d have to be careful though, since the order of the data is not >>> guaranteed. We have the luxury of being over to query over our hourly >>> buckets and assuming that all (most, really) of the data belongs in that >>> hour bucket. But, we use Camus to read from Kafka, which handles the time >>> bucket sorting for us. >>> >> >> Yep, the pipeline is kafkatee->udp2log->files rotated on a 15 min >> interval, and parser-script->mysql which runs on a separate system. >> >> Since the log files are stored one option would be to have a script that >> runs merges several files for a longer period sample, and sort and check >> for sequence gaps. Another option would be to modify the parse-to-mysql >> script to do the same thing. >> >> But the part I don't get yet is how a script looking at output logs would >> identify a problematic gap in sequence numbers. We have two collectors, one >> is 1:1 and the other sampled 1:10, and both filter on the GET string. So if >> my understanding of the sequence numbers is correct (they're per-proxy >> right?) we should see only a small sample of sequence numbers, and how that >> sample relates to overall traffic will vary greatly depending on >> fundraising campaign and what else is going on on the site. >> >> jg >> >> >>> Happy to chat more here or IRC. :) >>> >>> On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green <[email protected]> wrote: >>> Hi Nuria, thanks for raising the issue. Could you point me to the >>> script you're using for sequence checks? I'm definitely >>> interested in looking at how we might integrate that into >>> fundraising monitoring. >>> >>> On Thu, 7 Jul 2016, Nuria Ruiz wrote: >>> >>> (cc-ing analytics public list) >>> Fundraising folks: >>> >>> We were talking about the problems we have had with >>> clickstream data and kafka as of late and how to prevent >>> issues like this one going forward: >>> (https://phabricator.wikimedia.org/T132500) >>> >>> We think you guys could benefit from setting up the same set >>> of alarms on data integrity that we have on the >>> webrequest end and we ill be happy >>> to help with that at your convenience. >>> >>> An example of how these alarms could work (simplified >>> version): every message that comes from kafka has a >>> sequence Id, if sorted those sequence >>> Ids should be more or less contiguous, a gap in sequence ids >>> indicates an issue with data loss at the kafka >>> source. A script checks for sequence >>> ids and number of records and triggers an alarm if those two >>> do not match. >>> >>> Let us know if you want to proceed with this work. >>> >>> Thanks, >>> >>> Nuria >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> >>>
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
