Re: [Analytics] Issues with clickstream data

Andrew Otto Fri, 08 Jul 2016 08:14:49 -0700

> Another approach we discussed back in the day was setting up a canary
script to send known good messages whose delivery is monitored.
Aye, Jeff mentioned maybe doing that.  Not a bad idea.


Jeff, aye, you are right.  You wouldn’t be able to run the sequence number
check on your saved data.  Sorry, I forgot that it wasn’t just the full
webrequest_text.  You’d have to run another kafkatee output pipe then, to
check unsampled sequence numbers, similar to how the packet-loss.cpp script
worked with udp2log.



On Fri, Jul 8, 2016 at 11:05 AM, Toby Negrin <[email protected]> wrote:

> Another approach we discussed back in the day was setting up a canary
> script to send known good messages whose delivery is monitored. This might
> be a bit easier to set up.
>
> It's been effective on other systems I've worked on; also a good way to
> measure delivery latency.
>
> -Toby
>
>
> On Friday, July 8, 2016, Jeff Green <[email protected]> wrote:
>
>> On Fri, 8 Jul 2016, Andrew Otto wrote:
>>
>> We’ll, you won’t be able to do it exactly how we do, since we are loading
>>> the data into Hadoop and then checking it there, so we use Hadoop tools.
>>> Here’s what we got:
>>>
>>>
>>> https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql
>>>
>>>
>>> https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics_hourly.hql
>>>
>>> This old udp2log tool did a similar thing, so it is worth knowing about:
>>> https://github.com/wikimedia/analytics-udplog/blob/master/srcmisc/packet-loss.cpp
>>> However, it only worked with TSV udp2logs, and I think it won’t work with a
>>> multi-partition kafka topic, since seqs could be out of order based on
>>> partition read order.
>>>
>>> You guys do some kind of 15 (10?) minute roll ups, right?  You could
>>> probably do some very rough guesses on data loss in each 15 minute bucket.
>>> You’d have to be careful though, since the order of the data is not
>>> guaranteed.  We have the luxury of being over to query over our hourly
>>> buckets and assuming that all (most, really) of the data belongs in that
>>> hour bucket.  But, we use Camus to read from Kafka, which handles the time
>>> bucket sorting for us.
>>>
>>
>> Yep, the pipeline is kafkatee->udp2log->files rotated on a 15 min
>> interval, and parser-script->mysql which runs on a separate system.
>>
>> Since the log files are stored one option would be to have a script that
>> runs merges several files for a longer period sample, and sort and check
>> for sequence gaps. Another option would be to modify the parse-to-mysql
>> script to do the same thing.
>>
>> But the part I don't get yet is how a script looking at output logs would
>> identify a problematic gap in sequence numbers. We have two collectors, one
>> is 1:1 and the other sampled 1:10, and both filter on the GET string. So if
>> my understanding of the sequence numbers is correct (they're per-proxy
>> right?) we should see only a small sample of sequence numbers, and how that
>> sample relates to overall traffic will vary greatly depending on
>> fundraising campaign and what else is going on on the site.
>>
>> jg
>>
>>
>>> Happy to chat more here or IRC. :)
>>>
>>> On Fri, Jul 8, 2016 at 9:17 AM, Jeff Green <[email protected]> wrote:
>>>       Hi Nuria, thanks for raising the issue. Could you point me to the
>>> script you're using for sequence checks? I'm definitely
>>>       interested in looking at how we might integrate that into
>>> fundraising monitoring.
>>>
>>>       On Thu, 7 Jul 2016, Nuria Ruiz wrote:
>>>
>>>             (cc-ing analytics public list)
>>>             Fundraising folks:
>>>
>>>             We were talking about the problems we have had with
>>> clickstream data and kafka as of late and how to prevent
>>>             issues like this one going forward:
>>>             (https://phabricator.wikimedia.org/T132500)
>>>
>>>             We think you guys could benefit from setting up the same set
>>> of alarms on data integrity that we have on the
>>>             webrequest end and we ill be happy
>>>             to help with that at your convenience.
>>>
>>>             An example of how these alarms could work (simplified
>>> version): every message that comes from kafka has a
>>>             sequence Id, if sorted those sequence
>>>             Ids should be more or less contiguous, a gap in sequence ids
>>> indicates an issue with data loss at the kafka
>>>             source. A script checks for sequence
>>>             ids and number of records and triggers an alarm if those two
>>> do not match.
>>>
>>>             Let us know if you want to proceed with this work.
>>>
>>>             Thanks,
>>>
>>>             Nuria
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>>
>>>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Issues with clickstream data

Reply via email to