On 10 June 2015 at 12:00, Andrew Otto <[email protected]> wrote: > HmMmm. > > here’s no reason we couldn’t maintain beta level Kafka + Hadoop clusters in > labs. We probably should! I don’t really want to maintain them myself, but > they should be pretty easy to set up using hiera now. I could maintain them > if no on else wants to. > > Thought two: > > "so > when does new traffic data start hitting the analytics cluster?” > > If it is HTTP requests from varnish you are looking for, this will for the > most part just happen, unless the varnish cluster serving the requests is > different than the usual webrequest_sources you are used to seeing. I’m not > sure which varnishes RESTbase HTTP is using, but if they aren’t using one of > the usual ones we are already importing into HDFS, it would be trivial to > set this up. > > If I'm starting a service on Labs that provides data to third-parties, > what would analytics recommend my easiest path is to getting request > logs into Hadoop? > > We can’t do this into directly into production Analytics Cluster, since labs > is firewalled off from production networks. However, a service like this > would be intended to move to production eventually, yes? If so, then > perhaps a beta Analytics Cluster would allow you to develop the methods > needed to get data into Hadoop in Labs. Then the move into production would > be simpler and already have Analytics Cluster support.
That sounds better than nothing; not perfect, but totally understandable. The impression I'm really getting is "stuff should get off Labs ASAP" > > > 2. Event Logging. We're making this scale arbitrarily by moving it to > Kafka. Once that's done, we should be able to instrument pretty much > anything with Event Logging > > Dan, I’d like to not promise anything here at the moment. I think this > effort will significantly increase our throughput, but I’m not willing to > blame arbitrary scale. Unless we figure out a way to farm out and > parallelize eventlogging processors in an easy way, scaling eventlogging > even with Kafka to big data sizes will be cumbersome and manual. > > Eventually I’d like to have a system that is bound by hardware and not > architecture, but that is not well defined and still a long way off. We > will see. > > But, Dan is right, eventlogging might be a good way to labs data into > production Analytics Cluster, since any client can log via HTTP POSTs. We > aren’t currently importing eventlogging data into the Analytcs Cluster, but > one of the points of the almost finished eventlogging-kafka is to get this > data into Hadoop, so that should happen soon. > > The commitment has to be made on both sides. The teams building the > services have to instrument them, > > Agree. If you want HTTP requests to your services and those HTTP requests > go through varnish, this will be very easy. If you want anything beyond > that, the service developers will have to implement it. > > > > > > > > > On Jun 10, 2015, at 08:35, Dan Andreescu <[email protected]> wrote: > > > > On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes <[email protected]> wrote: >> >> On 10 June 2015 at 10:53, Dan Andreescu <[email protected]> wrote: >> > I see three ways for data to get into the cluster: >> > >> > 1. request stream, handled already, we're working on ways to pump the >> > data >> > back out through APIs >> >> Awesome, and it'd end up in the Hadoop cluster in a table? How...do we >> kick that off most easily? > > > Nono, I mean our specific web request stream. I don't think there's any way > to piggyback onto that for arbitrary other services. This is not an option > for you, it's just a way that data gets into the cluster, for completeness. > >> >> Second: what's best practices for this? What resources are available? >> >> If I'm starting a service on Labs that provides data to third-parties, >> > >> > >> > What exactly do you mean here? That's a loaded term and possibly >> > against >> > the labs privacy policy depending on what you mean. >> > >> >> An API, Dan ;) > > > Ok, so ... usage of the API is what you're after, I think piwik is probably > the best solution. > >> >> >> >> >> what would analytics recommend my easiest path is to getting request >> >> logs into Hadoop? >> > >> > >> > Weighing everything on balance, right now I'd say adding your name to >> > the >> > piwik supporters. So far, off the top of my head, that list is: >> > >> > * wikimedia store >> > * annual report >> > * the entire reading vertical >> > * russian wikimedia chapter (most likely all other chapters would chime >> > in >> > supporting it) >> > * a bunch of labs projects (including wikimetrics, vital signs, various >> > dashboards, etc.) >> > >> >> How is piwik linked to Hadoop? I'm not asking "how do we visualise the >> data" I'm asking how we get it into the cluster in the first place. > > > I think for the most part, piwik would handle reporting and crunching > numbers for you and get you some basic reports. But if we wanted to crunch > tons of data, we could integrate it with hadoop somehow. > > I'm kind of challenging IIDNHIHIDNH (If it did not happen in HDFS it did not > happen). > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > -- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
