Re: [Analytics] Issues with clickstream data

Jeff Green Fri, 08 Jul 2016 12:09:17 -0700

It sounds like a canary/heartbeat approach is the best fit for thefundraising scenario, we'll put that in the hopper. Thanks for all yourfeedback everyone!

jg


On Fri, 8 Jul 2016, Andrew Otto wrote:

> Another approach we discussed back in the day was setting up a canary script 
to send known good messages whose delivery
is monitored.
Aye, Jeff mentioned maybe doing that.  Not a bad idea.

Jeff, aye, you are right.  You wouldn’t be able to run the sequence number 
check on your saved data.  Sorry, I forgot that it wasn’t
just the full webrequest_text.  You’d have to run another kafkatee output pipe 
then, to check unsampled sequence numbers, similar to
how the packet-loss.cpp script worked with udp2log.



On Fri, Jul 8, 2016 at 11:05 AM, Toby Negrin <[email protected]> wrote:
      Another approach we discussed back in the day was setting up a canary 
script to send known good messages whose delivery
      is monitored. This might be a bit easier to set up.
It's been effective on other systems I've worked on; also a good way to measure 
delivery latency. 

-Toby

On Friday, July 8, 2016, Jeff Green <[email protected]> wrote:
      On Fri, 8 Jul 2016, Andrew Otto wrote:

            We’ll, you won’t be able to do it exactly how we do, since we are 
loading the data into Hadoop and then
            checking it there, so we use Hadoop tools.  Here’s what we got:

            
https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql

            
https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics_hourly.hql

            This old udp2log tool did a similar thing, so it is worth knowing 
about:
            
https://github.com/wikimedia/analytics-udplog/blob/master/srcmisc/packet-loss.cpp
 However, it only
            worked with TSV udp2logs, and I think it won’t work with a 
multi-partition kafka topic, since seqs could
            be out of order based on partition read order.

            You guys do some kind of 15 (10?) minute roll ups, right?  You 
could probably do some very rough guesses
            on data loss in each 15 minute bucket.  You’d have to be careful 
though, since the order of the data is
            not guaranteed.  We have the luxury of being over to query over our 
hourly buckets and assuming that all
            (most, really) of the data belongs in that hour bucket.  But, we 
use Camus to read from Kafka, which
            handles the time bucket sorting for us.


      Yep, the pipeline is kafkatee->udp2log->files rotated on a 15 min interval, 
and parser-script->mysql which runs on a
      separate system.

      Since the log files are stored one option would be to have a script that 
runs merges several files for a longer
      period sample, and sort and check for sequence gaps. Another option would 
be to modify the parse-to-mysql script to
      do the same thing.

      But the part I don't get yet is how a script looking at output logs would 
identify a problematic gap in sequence
      numbers. We have two collectors, one is 1:1 and the other sampled 1:10, 
and both filter on the GET string. So if my
      understanding of the sequence numbers is correct (they're per-proxy 
right?) we should see only a small sample of
      sequence numbers, and how that sample relates to overall traffic will 
vary greatly depending on fundraising campaign
      and what else is going on on the site.

      jg


            Happy to chat more here or IRC. :)

            On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green <[email protected]> 
wrote:
                  Hi Nuria, thanks for raising the issue. Could you point me to 
the script you're using for sequence
            checks? I'm definitely
                  interested in looking at how we might integrate that into 
fundraising monitoring.

                  On Thu, 7 Jul 2016, Nuria Ruiz wrote:

                        (cc-ing analytics public list)
                        Fundraising folks:

                        We were talking about the problems we have had with 
clickstream data and kafka as of late
            and how to prevent
                        issues like this one going forward:
                        (https://phabricator.wikimedia.org/T132500)

                        We think you guys could benefit from setting up the 
same set of alarms on data integrity
            that we have on the
                        webrequest end and we ill be happy
                        to help with that at your convenience. 

                        An example of how these alarms could work (simplified 
version): every message that comes
            from kafka has a
                        sequence Id, if sorted those sequence
                        Ids should be more or less contiguous, a gap in 
sequence ids indicates an issue with data
            loss at the kafka
                        source. A script checks for sequence
                        ids and number of records and triggers an alarm if 
those two do not match.

                        Let us know if you want to proceed with this work.

                        Thanks,

                        Nuria


            _______________________________________________
            Analytics mailing list
            [email protected]
            https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Issues with clickstream data

Reply via email to