Hadoop was originally built for indexing the web by processing the web map and exporting indexes to serving systems. I think integration with Elastic Search would work well.
-Toby On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou < [email protected]> wrote: > @Erik: > Reading this thread makes me think that it might be interesting to have a > chat around using hadoop for indexing ( > https://github.com/elastic/elasticsearch-hadoop). > I have no idea how you currently index, but I'd love to learn :) > Please let me know if you think it could be useful ! > Joseph > > On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson < > [email protected]> wrote: > >> makes sense. We will indeed be doing a batch process once a week to build >> the completion indices which ideally will run through all the wiki's in a >> day. We are going to do some analysis into how up to date our page view >> data really needs to be for scoring purposes though, if we can get good >> scoring results while only updating page view info when a page is edited we >> might be able to spread out the load across time that way and just hit the >> page view api once for each edit. Otherwise i'm sure we can do as suggested >> earlier and pull the data from hive directly and stuff into a temporary >> structure we can query while building the completion indices. >> >> On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu <[email protected]> >> wrote: >> >>> On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac <[email protected]> >>> wrote: >>> >>>> On 15 September 2015 at 19:37, Dan Andreescu <[email protected]> >>>> wrote: >>>> >>>>> I worry a little bit about the performance without having a batch api, >>>>>> but we can certainly try it out and see what happens. Basically we will >>>>>> be >>>>>> requesting the page view information for every NS_MAIN article in every >>>>>> wiki once a week. A quick sum against our search cluster suggests this >>>>>> is >>>>>> ~96 million api requests. >>>>>> >>>>> >>>> 96m equals approx 160 req/s which is more than sustainable for RESTBase. >>>> >>> >>> True, if we distributed the load over the whole week, but I think Erik >>> needs the results to be available weekly, as in, probably within a day or >>> so of issuing the request. Of course, if we were to serve this kind of >>> request from the API, we would make a better batch-query endpoint for his >>> use case. But I think it might be hard to make that useful generally. I >>> think for now, let's just collect these one-off pageview querying use cases >>> and slowly build them into the API when we can generalize two or more of >>> them into one endpoint. >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > -- > *Joseph Allemandou* > Data Engineer @ Wikimedia Foundation > IRC: joal > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
