Re: [Analytics] Issues with clickstream data

Jeff Green Fri, 08 Jul 2016 07:14:27 -0700

On Fri, 8 Jul 2016, Andrew Otto wrote:

We’ll, you won’t be able to do it exactly how we do, since we areloading the data into Hadoop and then checking it there, so we useHadoop tools. Here’s what we got:
https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql

https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics_hourly.hql
This old udp2log tool did a similar thing, so it is worth knowing about:https://github.com/wikimedia/analytics-udplog/blob/master/srcmisc/packet-loss.cppHowever, it only worked with TSV udp2logs, and I think it won’t workwith a multi-partition kafka topic, since seqs could be out of orderbased on partition read order.
You guys do some kind of 15 (10?) minute roll ups, right? You couldprobably do some very rough guesses on data loss in each 15 minutebucket. You’d have to be careful though, since the order of the data isnot guaranteed. We have the luxury of being over to query over ourhourly buckets and assuming that all (most, really) of the data belongsin that hour bucket. But, we use Camus to read from Kafka, whichhandles the time bucket sorting for us.

Yep, the pipeline is kafkatee->udp2log->files rotated on a 15 mininterval, and parser-script->mysql which runs on a separate system.

Since the log files are stored one option would be to have a script thatruns merges several files for a longer period sample, and sort and checkfor sequence gaps. Another option would be to modify the parse-to-mysqlscript to do the same thing.

But the part I don't get yet is how a script looking at output logs wouldidentify a problematic gap in sequence numbers. We have two collectors,one is 1:1 and the other sampled 1:10, and both filter on the GET string.So if my understanding of the sequence numbers is correct (they'reper-proxy right?) we should see only a small sample of sequence numbers,and how that sample relates to overall traffic will vary greatly dependingon fundraising campaign and what else is going on on the site.

jg


Happy to chat more here or IRC. :)

On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green <[email protected]> wrote:
      Hi Nuria, thanks for raising the issue. Could you point me to the script 
you're using for sequence checks? I'm definitely
      interested in looking at how we might integrate that into fundraising 
monitoring.

      On Thu, 7 Jul 2016, Nuria Ruiz wrote:

            (cc-ing analytics public list)
            Fundraising folks:

            We were talking about the problems we have had with clickstream 
data and kafka as of late and how to prevent
            issues like this one going forward:
            (https://phabricator.wikimedia.org/T132500)

            We think you guys could benefit from setting up the same set of 
alarms on data integrity that we have on the
            webrequest end and we ill be happy
            to help with that at your convenience. 

            An example of how these alarms could work (simplified version): 
every message that comes from kafka has a
            sequence Id, if sorted those sequence
            Ids should be more or less contiguous, a gap in sequence ids 
indicates an issue with data loss at the kafka
            source. A script checks for sequence
            ids and number of records and triggers an alarm if those two do not 
match.

            Let us know if you want to proceed with this work.

            Thanks,

            Nuria


_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Issues with clickstream data

Reply via email to