Well, we consume our Kafka streams into HDFS and check the sequence numbers with Hive through Oozie, the jobs and scripts are here:
https://github.com/wikimedia/analytics-refinery/tree/master/oozie/webrequest/load So it's a bit more complicated and not directly useful to your data flow (Kafkatee -> Mysql, right?). But we'd love to help you get familiar with the code and approach. This script computes the stats and puts them in wmf.webrequest_sequence_stats: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql This is then aggregated hourly, and checked by this workflow, which sends emails if it sees problems: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/check_sequence_statistics_workflow.xml We can then use information about data quality for each hour to re-run jobs, postpone jobs that would compute bad data, and so on. And we do some of that, but we've changed it a bit over the years so if you'd like more detail you can grab someone like Joseph and have a quick meeting. On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green <[email protected]> wrote: > Hi Nuria, thanks for raising the issue. Could you point me to the script > you're using for sequence checks? I'm definitely interested in looking at > how we might integrate that into fundraising monitoring. > > > On Thu, 7 Jul 2016, Nuria Ruiz wrote: > > (cc-ing analytics public list) >> Fundraising folks: >> >> We were talking about the problems we have had with clickstream data and >> kafka as of late and how to prevent issues like this one going forward: >> (https://phabricator.wikimedia.org/T132500) >> >> We think you guys could benefit from setting up the same set of alarms on >> data integrity that we have on the webrequest end and we ill be happy >> to help with that at your convenience. >> >> An example of how these alarms could work (simplified version): every >> message that comes from kafka has a sequence Id, if sorted those sequence >> Ids should be more or less contiguous, a gap in sequence ids indicates an >> issue with data loss at the kafka source. A script checks for sequence >> ids and number of records and triggers an alarm if those two do not match. >> >> Let us know if you want to proceed with this work. >> >> Thanks, >> >> Nuria >> >> > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
