On 10 June 2015 at 10:53, Dan Andreescu <[email protected]> wrote: > I see three ways for data to get into the cluster: > > 1. request stream, handled already, we're working on ways to pump the data > back out through APIs
Awesome, and it'd end up in the Hadoop cluster in a table? How...do we kick that off most easily? > > 2. Event Logging. We're making this scale arbitrarily by moving it to > Kafka. Once that's done, we should be able to instrument pretty much > anything with Event Logging > > 3. Piwik. There is a small but growing effort to stand up our own piwik > instance so we can get basic canned reports out of the box and not have to > reinvent the wheel for every single feature we're trying to instrument and > learn about. This could replace a lot of the use cases for Event Logging > and free up Event Logging to do more free-form research rather than cookie > cutter web analytics. > > Answers inline: > >> So I'm asking, I guess, two things. The first is: can we have a firm >> commitment that we'll get this kind of stuff into Hadoop? Right now we >> have a RESTful API everywhere that is not (to my knowledge) throwing >> data into the request logs. We have a WDQS that isn't either. >> Undoubtedly we have other tools I haven't encountered. It's paramount >> that the first question we ask with new services or systems is "so >> when does new traffic data start hitting the analytics cluster?" > > > The commitment has to be made on both sides. The teams building the > services have to instrument them, picking either 2 or 3 above. And then > we'll commit to supporting the path they choose. The piwik path may be slow > right now, fair warning. > > >> Second: what's best practices for this? What resources are available? >> If I'm starting a service on Labs that provides data to third-parties, > > > What exactly do you mean here? That's a loaded term and possibly against > the labs privacy policy depending on what you mean. > An API, Dan ;) >> >> what would analytics recommend my easiest path is to getting request >> logs into Hadoop? > > > Weighing everything on balance, right now I'd say adding your name to the > piwik supporters. So far, off the top of my head, that list is: > > * wikimedia store > * annual report > * the entire reading vertical > * russian wikimedia chapter (most likely all other chapters would chime in > supporting it) > * a bunch of labs projects (including wikimetrics, vital signs, various > dashboards, etc.) > How is piwik linked to Hadoop? I'm not asking "how do we visualise the data" I'm asking how we get it into the cluster in the first place. > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > -- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
