On 22 September 2015 at 05:10, Marko Obrovac <[email protected]> wrote:
> Hello,
>
> Just a small note which I don't think has been voiced thus far. There will
> actually be two APIs - one exposed by the Analytics' RESTBase instance,
> which will be accessible only from inside of WMF's infrastructure, and
> another, public-facing one (exposed by the Services' RESTBase instance).
>
> Now, these may be identical (both in layout and functionality) or may
> (slightly) differ. Which way to go? The big pro of them being identical is
> that the client wouldn't need to care which RESTBase instance it is actually
> contacting. That would also ease API maintenance. On the down side, that
> increases the overhead for Analytics to keep their domain list in sync.
>
> Having a more specialised API for the Analytics instance, on the other hand,
> would allow us to tailor it more for real internal use cases instead of
> focusing on the overall API coherence (which we need to do for the
> public-facing API). I'd honestly vote for that option.
>

Can you give an example of internal-facing use cases you don't see a
broader population of consumers being interested in?

> On 16 September 2015 at 16:06, Toby Negrin <[email protected]> wrote:
>>
>> Hadoop was originally built for indexing the web by processing the web map
>> and exporting indexes to serving systems. I think integration with Elastic
>> Search would work well.
>
>
> Right, both are indexing systems (so to speak), but the former is for
> offline use, while the latter targets online use. Ideally, we should make
> them cooperate to get the best out of both worlds.
>
> Cheers,
> Marko
>
>>
>>
>> -Toby
>>
>> On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou
>> <[email protected]> wrote:
>>>
>>> @Erik:
>>> Reading this thread makes me think that it might be interesting to have a
>>> chat around using hadoop for indexing
>>> (https://github.com/elastic/elasticsearch-hadoop).
>>> I have no idea how you currently index, but I'd love to learn :)
>>> Please let me know if you think it could be useful !
>>> Joseph
>>>
>>> On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson
>>> <[email protected]> wrote:
>>>>
>>>> makes sense. We will indeed be doing a batch process once a week to
>>>> build the completion indices which ideally will run through all the wiki's
>>>> in a day. We are going to do some analysis into how up to date our page 
>>>> view
>>>> data really needs to be for scoring purposes though, if we can get good
>>>> scoring results while only updating page view info when a page is edited we
>>>> might be able to spread out the load across time that way and just hit the
>>>> page view api once for each edit. Otherwise i'm sure we can do as suggested
>>>> earlier and pull the data from hive directly and stuff into a temporary
>>>> structure we can query while building the completion indices.
>>>>
>>>> On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu
>>>> <[email protected]> wrote:
>>>>>
>>>>> On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> On 15 September 2015 at 19:37, Dan Andreescu
>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> I worry a little bit about the performance without having a batch
>>>>>>>> api, but we can certainly try it out and see what happens. Basically 
>>>>>>>> we will
>>>>>>>> be requesting the page view information for every NS_MAIN article in 
>>>>>>>> every
>>>>>>>> wiki once a week.  A quick sum against our search  cluster suggests 
>>>>>>>> this is
>>>>>>>> ~96 million api requests.
>>>>>>
>>>>>>
>>>>>> 96m equals approx 160 req/s which is more than sustainable for
>>>>>> RESTBase.
>>>>>
>>>>>
>>>>> True, if we distributed the load over the whole week, but I think Erik
>>>>> needs the results to be available weekly, as in, probably within a day or 
>>>>> so
>>>>> of issuing the request.  Of course, if we were to serve this kind of 
>>>>> request
>>>>> from the API, we would make a better batch-query endpoint for his use 
>>>>> case.
>>>>> But I think it might be hard to make that useful generally.  I think for
>>>>> now, let's just collect these one-off pageview querying use cases and 
>>>>> slowly
>>>>> build them into the API when we can generalize two or more of them into 
>>>>> one
>>>>> endpoint.
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>
>>>
>>>
>>> --
>>> Joseph Allemandou
>>> Data Engineer @ Wikimedia Foundation
>>> IRC: joal
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
>
> --
> Marko Obrovac, PhD
> Senior Services Engineer
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to