Another approach we discussed back in the day was setting up a canary script to send known good messages whose delivery is monitored. This might be a bit easier to set up.
It's been effective on other systems I've worked on; also a good way to measure delivery latency. -Toby On Friday, July 8, 2016, Jeff Green <[email protected]> wrote: > On Fri, 8 Jul 2016, Andrew Otto wrote: > > We’ll, you won’t be able to do it exactly how we do, since we are loading >> the data into Hadoop and then checking it there, so we use Hadoop tools. >> Here’s what we got: >> >> >> https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql >> >> >> https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics_hourly.hql >> >> This old udp2log tool did a similar thing, so it is worth knowing about: >> https://github.com/wikimedia/analytics-udplog/blob/master/srcmisc/packet-loss.cpp >> However, it only worked with TSV udp2logs, and I think it won’t work with a >> multi-partition kafka topic, since seqs could be out of order based on >> partition read order. >> >> You guys do some kind of 15 (10?) minute roll ups, right? You could >> probably do some very rough guesses on data loss in each 15 minute bucket. >> You’d have to be careful though, since the order of the data is not >> guaranteed. We have the luxury of being over to query over our hourly >> buckets and assuming that all (most, really) of the data belongs in that >> hour bucket. But, we use Camus to read from Kafka, which handles the time >> bucket sorting for us. >> > > Yep, the pipeline is kafkatee->udp2log->files rotated on a 15 min > interval, and parser-script->mysql which runs on a separate system. > > Since the log files are stored one option would be to have a script that > runs merges several files for a longer period sample, and sort and check > for sequence gaps. Another option would be to modify the parse-to-mysql > script to do the same thing. > > But the part I don't get yet is how a script looking at output logs would > identify a problematic gap in sequence numbers. We have two collectors, one > is 1:1 and the other sampled 1:10, and both filter on the GET string. So if > my understanding of the sequence numbers is correct (they're per-proxy > right?) we should see only a small sample of sequence numbers, and how that > sample relates to overall traffic will vary greatly depending on > fundraising campaign and what else is going on on the site. > > jg > > >> Happy to chat more here or IRC. :) >> >> On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green <[email protected]> wrote: >> Hi Nuria, thanks for raising the issue. Could you point me to the >> script you're using for sequence checks? I'm definitely >> interested in looking at how we might integrate that into >> fundraising monitoring. >> >> On Thu, 7 Jul 2016, Nuria Ruiz wrote: >> >> (cc-ing analytics public list) >> Fundraising folks: >> >> We were talking about the problems we have had with >> clickstream data and kafka as of late and how to prevent >> issues like this one going forward: >> (https://phabricator.wikimedia.org/T132500) >> >> We think you guys could benefit from setting up the same set >> of alarms on data integrity that we have on the >> webrequest end and we ill be happy >> to help with that at your convenience. >> >> An example of how these alarms could work (simplified >> version): every message that comes from kafka has a >> sequence Id, if sorted those sequence >> Ids should be more or less contiguous, a gap in sequence ids >> indicates an issue with data loss at the kafka >> source. A script checks for sequence >> ids and number of records and triggers an alarm if those two >> do not match. >> >> Let us know if you want to proceed with this work. >> >> Thanks, >> >> Nuria >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >>
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
