Well, we consume our Kafka streams into HDFS and check the sequence numbers
with Hive through Oozie, the jobs and scripts are here:

https://github.com/wikimedia/analytics-refinery/tree/master/oozie/webrequest/load

So it's a bit more complicated and not directly useful to your data flow
(Kafkatee -> Mysql, right?).  But we'd love to help you get familiar with
the code and approach.  This script computes the stats and puts them in
wmf.webrequest_sequence_stats:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql

This is then aggregated hourly, and checked by this workflow, which sends
emails if it sees problems:

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/check_sequence_statistics_workflow.xml

We can then use information about data quality for each hour to re-run
jobs, postpone jobs that would compute bad data, and so on.  And we do some
of that, but we've changed it a bit over the years so if you'd like more
detail you can grab someone like Joseph and have a quick meeting.

On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green <[email protected]> wrote:

> Hi Nuria, thanks for raising the issue. Could you point me to the script
> you're using for sequence checks? I'm definitely interested in looking at
> how we might integrate that into fundraising monitoring.
>
>
> On Thu, 7 Jul 2016, Nuria Ruiz wrote:
>
> (cc-ing analytics public list)
>> Fundraising folks:
>>
>> We were talking about the problems we have had with clickstream data and
>> kafka as of late and how to prevent issues like this one going forward:
>> (https://phabricator.wikimedia.org/T132500)
>>
>> We think you guys could benefit from setting up the same set of alarms on
>> data integrity that we have on the webrequest end and we ill be happy
>> to help with that at your convenience.
>>
>> An example of how these alarms could work (simplified version): every
>> message that comes from kafka has a sequence Id, if sorted those sequence
>> Ids should be more or less contiguous, a gap in sequence ids indicates an
>> issue with data loss at the kafka source. A script checks for sequence
>> ids and number of records and triggers an alarm if those two do not match.
>>
>> Let us know if you want to proceed with this work.
>>
>> Thanks,
>>
>> Nuria
>>
>>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to