Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

Oliver Keyes Wed, 10 Jun 2015 08:03:25 -0700

On 10 June 2015 at 10:53, Dan Andreescu <[email protected]> wrote:
> I see three ways for data to get into the cluster:
>
> 1. request stream, handled already, we're working on ways to pump the data
> back out through APIs


Awesome, and it'd end up in the Hadoop cluster in a table? How...do we
kick that off most easily?

>
> 2. Event Logging.  We're making this scale arbitrarily by moving it to
> Kafka.  Once that's done, we should be able to instrument pretty much
> anything with Event Logging
>
> 3. Piwik.  There is a small but growing effort to stand up our own piwik
> instance so we can get basic canned reports out of the box and not have to
> reinvent the wheel for every single feature we're trying to instrument and
> learn about.  This could replace a lot of the use cases for Event Logging
> and free up Event Logging to do more free-form research rather than cookie
> cutter web analytics.
>
> Answers inline:
>
>> So I'm asking, I guess, two things. The first is: can we have a firm
>> commitment that we'll get this kind of stuff into Hadoop? Right now we
>> have a RESTful API everywhere that is not (to my knowledge) throwing
>> data into the request logs. We have a WDQS that isn't either.
>> Undoubtedly we have other tools I haven't encountered. It's paramount
>> that the first question we ask with new services or systems is "so
>> when does new traffic data start hitting the analytics cluster?"
>
>
> The commitment has to be made on both sides.  The teams building the
> services have to instrument them, picking either 2 or 3 above.  And then
> we'll commit to supporting the path they choose.  The piwik path may be slow
> right now, fair warning.
>
>
>> Second: what's best practices for this? What resources are available?
>> If I'm starting a service on Labs that provides data to third-parties,
>
>
> What exactly do you mean here?  That's a loaded term and possibly against
> the labs privacy policy depending on what you mean.
>

An API, Dan ;)

>>
>> what would analytics recommend my easiest path is to getting request
>> logs into Hadoop?
>
>
> Weighing everything on balance, right now I'd say adding your name to the
> piwik supporters.  So far, off the top of my head, that list is:
>
> * wikimedia store
> * annual report
> * the entire reading vertical
> * russian wikimedia chapter (most likely all other chapters would chime in
> supporting it)
> * a bunch of labs projects (including wikimetrics, vital signs, various
> dashboards, etc.)
>

How is piwik linked to Hadoop? I'm not asking "how do we visualise the
data" I'm asking how we get it into the cluster in the first place.

> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

Reply via email to