Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

Oliver Keyes Wed, 10 Jun 2015 09:28:15 -0700

At the moment I don't have specific questions because we're trying to
just get the thing set up. But, wider context and a prediction:


The budget this year has ensured, at least for Discovery, that ops and
hardware support are slashed to the bone. Because of this we're
deploying bigger and bigger things on Labs - I wouldn't describe
Wikidata Query Service as a "small and fairly limited virtual machine"
- because there we actually have machines (sure, virtual ones. but
machines. There is hardware). This isn't going to stop until people
have resourcing to throw them out on production. So from where I'm
sitting it looks like the options are "no integration around
analytics" or "stop building anything until you have machines in prod
for it".

I don't want to give AnEng a ton of work but neither of these options
seem particularly appealing, particularly since I have a mandate to
/get/ analytics for those things we're building. And not having a
cluster on labs, or cluster access from labs, doesn't remove the
headache, it just shifts it downstream, because now every analyst
generating metrics from these services has to integrate an entirely
new set of things into their workflows.

On 10 June 2015 at 12:15, Dan Andreescu <[email protected]> wrote:
> I think this thread is a bit too vague.  If piwik is woefully inadequate,
> then what kind of analysis is needed for the use cases you're talking about?
> It doesn't seem obvious that we need endlessly scalable systems like Hadoop
> to analyze data gathered by small and fairly limited virtual machines.
>
> I agree with Andrew's Beta Analytics cluster idea, but I think we need to
> get specific here in order to come up with a good first step.
>
> On Wed, Jun 10, 2015 at 12:09 PM, Oliver Keyes <[email protected]> wrote:
>>
>> On 10 June 2015 at 12:00, Andrew Otto <[email protected]> wrote:
>> > HmMmm.
>> >
>> > here’s no reason we couldn’t maintain beta level Kafka + Hadoop clusters
>> > in
>> > labs.  We probably should!  I don’t really want to maintain them myself,
>> > but
>> > they should be pretty easy to set up using hiera now.  I could maintain
>> > them
>> > if no on else wants to.
>> >
>> > Thought two:
>> >
>> > "so
>> > when does new traffic data start hitting the analytics cluster?”
>> >
>> > If it is HTTP requests from varnish you are looking for, this will for
>> > the
>> > most part just happen, unless the varnish cluster serving the requests
>> > is
>> > different than the usual webrequest_sources you are used to seeing.  I’m
>> > not
>> > sure which varnishes RESTbase HTTP is using, but if they aren’t using
>> > one of
>> > the usual ones we are already importing into HDFS, it would be trivial
>> > to
>> > set this up.
>> >
>> > If I'm starting a service on Labs that provides data to third-parties,
>> > what would analytics recommend my easiest path is to getting request
>> > logs into Hadoop?
>> >
>> > We can’t do this into directly into production Analytics Cluster, since
>> > labs
>> > is firewalled off from production networks.  However, a service like
>> > this
>> > would be intended to move to production eventually, yes?  If so, then
>> > perhaps a beta Analytics Cluster would allow you to develop the methods
>> > needed to get data into Hadoop in Labs.  Then the move into production
>> > would
>> > be simpler and already have Analytics Cluster support.
>>
>> That sounds better than nothing; not perfect, but totally
>> understandable. The impression I'm really getting is "stuff should get
>> off Labs ASAP"
>>
>> >
>> >
>> > 2. Event Logging.  We're making this scale arbitrarily by moving it to
>> > Kafka.  Once that's done, we should be able to instrument pretty much
>> > anything with Event Logging
>> >
>> > Dan, I’d like to not promise anything here at the moment.  I think this
>> > effort will significantly increase our throughput, but I’m not willing
>> > to
>> > blame arbitrary scale.  Unless we figure out a way to farm out and
>> > parallelize eventlogging processors in an easy way, scaling eventlogging
>> > even with Kafka to big data sizes will be cumbersome and manual.
>> >
>> > Eventually I’d like to have a system that is bound by hardware and not
>> > architecture, but that is not well defined and still a long way off.  We
>> > will see.
>> >
>> > But, Dan is right, eventlogging might be a good way to labs data into
>> > production Analytics Cluster, since any client can log via HTTP POSTs.
>> > We
>> > aren’t currently importing eventlogging data into the Analytcs Cluster,
>> > but
>> > one of the points of the almost finished eventlogging-kafka is to get
>> > this
>> > data into Hadoop, so that should happen soon.
>> >
>> > The commitment has to be made on both sides.  The teams building the
>> > services have to instrument them,
>> >
>> > Agree.  If you want HTTP requests to your services and those HTTP
>> > requests
>> > go through varnish, this will be very easy.  If you want anything beyond
>> > that, the service developers will have to implement it.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Jun 10, 2015, at 08:35, Dan Andreescu <[email protected]>
>> > wrote:
>> >
>> >
>> >
>> > On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes <[email protected]>
>> > wrote:
>> >>
>> >> On 10 June 2015 at 10:53, Dan Andreescu <[email protected]>
>> >> wrote:
>> >> > I see three ways for data to get into the cluster:
>> >> >
>> >> > 1. request stream, handled already, we're working on ways to pump the
>> >> > data
>> >> > back out through APIs
>> >>
>> >> Awesome, and it'd end up in the Hadoop cluster in a table? How...do we
>> >> kick that off most easily?
>> >
>> >
>> > Nono, I mean our specific web request stream.  I don't think there's any
>> > way
>> > to piggyback onto that for arbitrary other services.  This is not an
>> > option
>> > for you, it's just a way that data gets into the cluster, for
>> > completeness.
>> >
>> >> >> Second: what's best practices for this? What resources are
>> >> >> available?
>> >> >> If I'm starting a service on Labs that provides data to
>> >> >> third-parties,
>> >> >
>> >> >
>> >> > What exactly do you mean here?  That's a loaded term and possibly
>> >> > against
>> >> > the labs privacy policy depending on what you mean.
>> >> >
>> >>
>> >> An API, Dan ;)
>> >
>> >
>> > Ok, so ... usage of the API is what you're after, I think piwik is
>> > probably
>> > the best solution.
>> >
>> >>
>> >> >>
>> >> >> what would analytics recommend my easiest path is to getting request
>> >> >> logs into Hadoop?
>> >> >
>> >> >
>> >> > Weighing everything on balance, right now I'd say adding your name to
>> >> > the
>> >> > piwik supporters.  So far, off the top of my head, that list is:
>> >> >
>> >> > * wikimedia store
>> >> > * annual report
>> >> > * the entire reading vertical
>> >> > * russian wikimedia chapter (most likely all other chapters would
>> >> > chime
>> >> > in
>> >> > supporting it)
>> >> > * a bunch of labs projects (including wikimetrics, vital signs,
>> >> > various
>> >> > dashboards, etc.)
>> >> >
>> >>
>> >> How is piwik linked to Hadoop? I'm not asking "how do we visualise the
>> >> data" I'm asking how we get it into the cluster in the first place.
>> >
>> >
>> > I think for the most part, piwik would handle reporting and crunching
>> > numbers for you and get you some basic reports.  But if we wanted to
>> > crunch
>> > tons of data, we could integrate it with hadoop somehow.
>> >
>> > I'm kind of challenging IIDNHIHIDNH (If it did not happen in HDFS it did
>> > not
>> > happen).
>> > _______________________________________________
>> > Analytics mailing list
>> > [email protected]
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>> >
>> >
>> > _______________________________________________
>> > Analytics mailing list
>> > [email protected]
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>>
>>
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

Reply via email to