Hello,

Just a small note which I don't think has been voiced thus far. There will
actually be two APIs - one exposed by the Analytics' RESTBase instance,
which will be accessible only from inside of WMF's infrastructure, and
another, public-facing one (exposed by the Services' RESTBase instance).

Now, these may be identical (both in layout and functionality) or may
(slightly) differ. Which way to go? The big pro of them being identical is
that the client wouldn't need to care which RESTBase instance it is
actually contacting. That would also ease API maintenance. On the down
side, that increases the overhead for Analytics to keep their domain list
in sync.

Having a more specialised API for the Analytics instance, on the other
hand, would allow us to tailor it more for real internal use cases instead
of focusing on the overall API coherence (which we need to do for the
public-facing API). I'd honestly vote for that option.

On 16 September 2015 at 16:06, Toby Negrin <[email protected]> wrote:

> Hadoop was originally built for indexing the web by processing the web map
> and exporting indexes to serving systems. I think integration with Elastic
> Search would work well.
>

Right, both are indexing systems (so to speak), but the former is for
offline use, while the latter targets online use. Ideally, we should make
them cooperate to get the best out of both worlds.

Cheers,
Marko


>
> -Toby
>
> On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou <
> [email protected]> wrote:
>
>> @Erik:
>> Reading this thread makes me think that it might be interesting to have a
>> chat around using hadoop for indexing (
>> https://github.com/elastic/elasticsearch-hadoop).
>> I have no idea how you currently index, but I'd love to learn :)
>> Please let me know if you think it could be useful !
>> Joseph
>>
>> On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson <
>> [email protected]> wrote:
>>
>>> makes sense. We will indeed be doing a batch process once a week to
>>> build the completion indices which ideally will run through all the wiki's
>>> in a day. We are going to do some analysis into how up to date our page
>>> view data really needs to be for scoring purposes though, if we can get
>>> good scoring results while only updating page view info when a page is
>>> edited we might be able to spread out the load across time that way and
>>> just hit the page view api once for each edit. Otherwise i'm sure we can do
>>> as suggested earlier and pull the data from hive directly and stuff into a
>>> temporary structure we can query while building the completion indices.
>>>
>>> On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu <[email protected]
>>> > wrote:
>>>
>>>> On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac <[email protected]>
>>>> wrote:
>>>>
>>>>> On 15 September 2015 at 19:37, Dan Andreescu <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> I worry a little bit about the performance without having a batch
>>>>>>> api, but we can certainly try it out and see what happens. Basically we
>>>>>>> will be requesting the page view information for every NS_MAIN article 
>>>>>>> in
>>>>>>> every wiki once a week.  A quick sum against our search  cluster 
>>>>>>> suggests
>>>>>>> this is ~96 million api requests.
>>>>>>>
>>>>>>
>>>>> 96m equals approx 160 req/s which is more than sustainable for
>>>>> RESTBase.
>>>>>
>>>>
>>>> True, if we distributed the load over the whole week, but I think Erik
>>>> needs the results to be available weekly, as in, probably within a day or
>>>> so of issuing the request.  Of course, if we were to serve this kind of
>>>> request from the API, we would make a better batch-query endpoint for his
>>>> use case.  But I think it might be hard to make that useful generally.  I
>>>> think for now, let's just collect these one-off pageview querying use cases
>>>> and slowly build them into the API when we can generalize two or more of
>>>> them into one endpoint.
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>>
>> --
>> *Joseph Allemandou*
>> Data Engineer @ Wikimedia Foundation
>> IRC: joal
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
Marko Obrovac, PhD
Senior Services Engineer
Wikimedia Foundation
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to