Hadoop was originally built for indexing the web by processing the web map
and exporting indexes to serving systems. I think integration with Elastic
Search would work well.

-Toby

On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou <
[email protected]> wrote:

> @Erik:
> Reading this thread makes me think that it might be interesting to have a
> chat around using hadoop for indexing (
> https://github.com/elastic/elasticsearch-hadoop).
> I have no idea how you currently index, but I'd love to learn :)
> Please let me know if you think it could be useful !
> Joseph
>
> On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson <
> [email protected]> wrote:
>
>> makes sense. We will indeed be doing a batch process once a week to build
>> the completion indices which ideally will run through all the wiki's in a
>> day. We are going to do some analysis into how up to date our page view
>> data really needs to be for scoring purposes though, if we can get good
>> scoring results while only updating page view info when a page is edited we
>> might be able to spread out the load across time that way and just hit the
>> page view api once for each edit. Otherwise i'm sure we can do as suggested
>> earlier and pull the data from hive directly and stuff into a temporary
>> structure we can query while building the completion indices.
>>
>> On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu <[email protected]>
>> wrote:
>>
>>> On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac <[email protected]>
>>> wrote:
>>>
>>>> On 15 September 2015 at 19:37, Dan Andreescu <[email protected]>
>>>> wrote:
>>>>
>>>>> I worry a little bit about the performance without having a batch api,
>>>>>> but we can certainly try it out and see what happens. Basically we will 
>>>>>> be
>>>>>> requesting the page view information for every NS_MAIN article in every
>>>>>> wiki once a week.  A quick sum against our search  cluster suggests this 
>>>>>> is
>>>>>> ~96 million api requests.
>>>>>>
>>>>>
>>>> 96m equals approx 160 req/s which is more than sustainable for RESTBase.
>>>>
>>>
>>> True, if we distributed the load over the whole week, but I think Erik
>>> needs the results to be available weekly, as in, probably within a day or
>>> so of issuing the request.  Of course, if we were to serve this kind of
>>> request from the API, we would make a better batch-query endpoint for his
>>> use case.  But I think it might be hard to make that useful generally.  I
>>> think for now, let's just collect these one-off pageview querying use cases
>>> and slowly build them into the API when we can generalize two or more of
>>> them into one endpoint.
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> *Joseph Allemandou*
> Data Engineer @ Wikimedia Foundation
> IRC: joal
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to