Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

Oliver Keyes Wed, 10 Jun 2015 09:10:20 -0700

On 10 June 2015 at 12:00, Andrew Otto <[email protected]> wrote:
> HmMmm.
>
> here’s no reason we couldn’t maintain beta level Kafka + Hadoop clusters in
> labs.  We probably should!  I don’t really want to maintain them myself, but
> they should be pretty easy to set up using hiera now.  I could maintain them
> if no on else wants to.
>
> Thought two:
>
> "so
> when does new traffic data start hitting the analytics cluster?”
>
> If it is HTTP requests from varnish you are looking for, this will for the
> most part just happen, unless the varnish cluster serving the requests is
> different than the usual webrequest_sources you are used to seeing.  I’m not
> sure which varnishes RESTbase HTTP is using, but if they aren’t using one of
> the usual ones we are already importing into HDFS, it would be trivial to
> set this up.
>
> If I'm starting a service on Labs that provides data to third-parties,
> what would analytics recommend my easiest path is to getting request
> logs into Hadoop?
>
> We can’t do this into directly into production Analytics Cluster, since labs
> is firewalled off from production networks.  However, a service like this
> would be intended to move to production eventually, yes?  If so, then
> perhaps a beta Analytics Cluster would allow you to develop the methods
> needed to get data into Hadoop in Labs.  Then the move into production would
> be simpler and already have Analytics Cluster support.


That sounds better than nothing; not perfect, but totally
understandable. The impression I'm really getting is "stuff should get
off Labs ASAP"

>
>
> 2. Event Logging.  We're making this scale arbitrarily by moving it to
> Kafka.  Once that's done, we should be able to instrument pretty much
> anything with Event Logging
>
> Dan, I’d like to not promise anything here at the moment.  I think this
> effort will significantly increase our throughput, but I’m not willing to
> blame arbitrary scale.  Unless we figure out a way to farm out and
> parallelize eventlogging processors in an easy way, scaling eventlogging
> even with Kafka to big data sizes will be cumbersome and manual.
>
> Eventually I’d like to have a system that is bound by hardware and not
> architecture, but that is not well defined and still a long way off.  We
> will see.
>
> But, Dan is right, eventlogging might be a good way to labs data into
> production Analytics Cluster, since any client can log via HTTP POSTs.  We
> aren’t currently importing eventlogging data into the Analytcs Cluster, but
> one of the points of the almost finished eventlogging-kafka is to get this
> data into Hadoop, so that should happen soon.
>
> The commitment has to be made on both sides.  The teams building the
> services have to instrument them,
>
> Agree.  If you want HTTP requests to your services and those HTTP requests
> go through varnish, this will be very easy.  If you want anything beyond
> that, the service developers will have to implement it.
>
>
>
>
>
>
>
>
> On Jun 10, 2015, at 08:35, Dan Andreescu <[email protected]> wrote:
>
>
>
> On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes <[email protected]> wrote:
>>
>> On 10 June 2015 at 10:53, Dan Andreescu <[email protected]> wrote:
>> > I see three ways for data to get into the cluster:
>> >
>> > 1. request stream, handled already, we're working on ways to pump the
>> > data
>> > back out through APIs
>>
>> Awesome, and it'd end up in the Hadoop cluster in a table? How...do we
>> kick that off most easily?
>
>
> Nono, I mean our specific web request stream.  I don't think there's any way
> to piggyback onto that for arbitrary other services.  This is not an option
> for you, it's just a way that data gets into the cluster, for completeness.
>
>> >> Second: what's best practices for this? What resources are available?
>> >> If I'm starting a service on Labs that provides data to third-parties,
>> >
>> >
>> > What exactly do you mean here?  That's a loaded term and possibly
>> > against
>> > the labs privacy policy depending on what you mean.
>> >
>>
>> An API, Dan ;)
>
>
> Ok, so ... usage of the API is what you're after, I think piwik is probably
> the best solution.
>
>>
>> >>
>> >> what would analytics recommend my easiest path is to getting request
>> >> logs into Hadoop?
>> >
>> >
>> > Weighing everything on balance, right now I'd say adding your name to
>> > the
>> > piwik supporters.  So far, off the top of my head, that list is:
>> >
>> > * wikimedia store
>> > * annual report
>> > * the entire reading vertical
>> > * russian wikimedia chapter (most likely all other chapters would chime
>> > in
>> > supporting it)
>> > * a bunch of labs projects (including wikimetrics, vital signs, various
>> > dashboards, etc.)
>> >
>>
>> How is piwik linked to Hadoop? I'm not asking "how do we visualise the
>> data" I'm asking how we get it into the cluster in the first place.
>
>
> I think for the most part, piwik would handle reporting and crunching
> numbers for you and get you some basic reports.  But if we wanted to crunch
> tons of data, we could integrate it with hadoop somehow.
>
> I'm kind of challenging IIDNHIHIDNH (If it did not happen in HDFS it did not
> happen).
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

Reply via email to