Another approach we discussed back in the day was setting up a canary
script to send known good messages whose delivery is monitored. This might
be a bit easier to set up.

It's been effective on other systems I've worked on; also a good way to
measure delivery latency.

-Toby

On Friday, July 8, 2016, Jeff Green <[email protected]> wrote:

> On Fri, 8 Jul 2016, Andrew Otto wrote:
>
> We’ll, you won’t be able to do it exactly how we do, since we are loading
>> the data into Hadoop and then checking it there, so we use Hadoop tools.
>> Here’s what we got:
>>
>>
>> https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql
>>
>>
>> https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics_hourly.hql
>>
>> This old udp2log tool did a similar thing, so it is worth knowing about:
>> https://github.com/wikimedia/analytics-udplog/blob/master/srcmisc/packet-loss.cpp
>> However, it only worked with TSV udp2logs, and I think it won’t work with a
>> multi-partition kafka topic, since seqs could be out of order based on
>> partition read order.
>>
>> You guys do some kind of 15 (10?) minute roll ups, right?  You could
>> probably do some very rough guesses on data loss in each 15 minute bucket.
>> You’d have to be careful though, since the order of the data is not
>> guaranteed.  We have the luxury of being over to query over our hourly
>> buckets and assuming that all (most, really) of the data belongs in that
>> hour bucket.  But, we use Camus to read from Kafka, which handles the time
>> bucket sorting for us.
>>
>
> Yep, the pipeline is kafkatee->udp2log->files rotated on a 15 min
> interval, and parser-script->mysql which runs on a separate system.
>
> Since the log files are stored one option would be to have a script that
> runs merges several files for a longer period sample, and sort and check
> for sequence gaps. Another option would be to modify the parse-to-mysql
> script to do the same thing.
>
> But the part I don't get yet is how a script looking at output logs would
> identify a problematic gap in sequence numbers. We have two collectors, one
> is 1:1 and the other sampled 1:10, and both filter on the GET string. So if
> my understanding of the sequence numbers is correct (they're per-proxy
> right?) we should see only a small sample of sequence numbers, and how that
> sample relates to overall traffic will vary greatly depending on
> fundraising campaign and what else is going on on the site.
>
> jg
>
>
>> Happy to chat more here or IRC. :)
>>
>> On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green <[email protected]> wrote:
>>       Hi Nuria, thanks for raising the issue. Could you point me to the
>> script you're using for sequence checks? I'm definitely
>>       interested in looking at how we might integrate that into
>> fundraising monitoring.
>>
>>       On Thu, 7 Jul 2016, Nuria Ruiz wrote:
>>
>>             (cc-ing analytics public list)
>>             Fundraising folks:
>>
>>             We were talking about the problems we have had with
>> clickstream data and kafka as of late and how to prevent
>>             issues like this one going forward:
>>             (https://phabricator.wikimedia.org/T132500)
>>
>>             We think you guys could benefit from setting up the same set
>> of alarms on data integrity that we have on the
>>             webrequest end and we ill be happy
>>             to help with that at your convenience.
>>
>>             An example of how these alarms could work (simplified
>> version): every message that comes from kafka has a
>>             sequence Id, if sorted those sequence
>>             Ids should be more or less contiguous, a gap in sequence ids
>> indicates an issue with data loss at the kafka
>>             source. A script checks for sequence
>>             ids and number of records and triggers an alarm if those two
>> do not match.
>>
>>             Let us know if you want to proceed with this work.
>>
>>             Thanks,
>>
>>             Nuria
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>>
>>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to