On 22 September 2015 at 05:10, Marko Obrovac <[email protected]> wrote: > Hello, > > Just a small note which I don't think has been voiced thus far. There will > actually be two APIs - one exposed by the Analytics' RESTBase instance, > which will be accessible only from inside of WMF's infrastructure, and > another, public-facing one (exposed by the Services' RESTBase instance). > > Now, these may be identical (both in layout and functionality) or may > (slightly) differ. Which way to go? The big pro of them being identical is > that the client wouldn't need to care which RESTBase instance it is actually > contacting. That would also ease API maintenance. On the down side, that > increases the overhead for Analytics to keep their domain list in sync. > > Having a more specialised API for the Analytics instance, on the other hand, > would allow us to tailor it more for real internal use cases instead of > focusing on the overall API coherence (which we need to do for the > public-facing API). I'd honestly vote for that option. >
Can you give an example of internal-facing use cases you don't see a broader population of consumers being interested in? > On 16 September 2015 at 16:06, Toby Negrin <[email protected]> wrote: >> >> Hadoop was originally built for indexing the web by processing the web map >> and exporting indexes to serving systems. I think integration with Elastic >> Search would work well. > > > Right, both are indexing systems (so to speak), but the former is for > offline use, while the latter targets online use. Ideally, we should make > them cooperate to get the best out of both worlds. > > Cheers, > Marko > >> >> >> -Toby >> >> On Wed, Sep 16, 2015 at 7:03 AM, Joseph Allemandou >> <[email protected]> wrote: >>> >>> @Erik: >>> Reading this thread makes me think that it might be interesting to have a >>> chat around using hadoop for indexing >>> (https://github.com/elastic/elasticsearch-hadoop). >>> I have no idea how you currently index, but I'd love to learn :) >>> Please let me know if you think it could be useful ! >>> Joseph >>> >>> On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson >>> <[email protected]> wrote: >>>> >>>> makes sense. We will indeed be doing a batch process once a week to >>>> build the completion indices which ideally will run through all the wiki's >>>> in a day. We are going to do some analysis into how up to date our page >>>> view >>>> data really needs to be for scoring purposes though, if we can get good >>>> scoring results while only updating page view info when a page is edited we >>>> might be able to spread out the load across time that way and just hit the >>>> page view api once for each edit. Otherwise i'm sure we can do as suggested >>>> earlier and pull the data from hive directly and stuff into a temporary >>>> structure we can query while building the completion indices. >>>> >>>> On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu >>>> <[email protected]> wrote: >>>>> >>>>> On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac <[email protected]> >>>>> wrote: >>>>>> >>>>>> On 15 September 2015 at 19:37, Dan Andreescu >>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>> I worry a little bit about the performance without having a batch >>>>>>>> api, but we can certainly try it out and see what happens. Basically >>>>>>>> we will >>>>>>>> be requesting the page view information for every NS_MAIN article in >>>>>>>> every >>>>>>>> wiki once a week. A quick sum against our search cluster suggests >>>>>>>> this is >>>>>>>> ~96 million api requests. >>>>>> >>>>>> >>>>>> 96m equals approx 160 req/s which is more than sustainable for >>>>>> RESTBase. >>>>> >>>>> >>>>> True, if we distributed the load over the whole week, but I think Erik >>>>> needs the results to be available weekly, as in, probably within a day or >>>>> so >>>>> of issuing the request. Of course, if we were to serve this kind of >>>>> request >>>>> from the API, we would make a better batch-query endpoint for his use >>>>> case. >>>>> But I think it might be hard to make that useful generally. I think for >>>>> now, let's just collect these one-off pageview querying use cases and >>>>> slowly >>>>> build them into the API when we can generalize two or more of them into >>>>> one >>>>> endpoint. >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>> >>> >>> >>> -- >>> Joseph Allemandou >>> Data Engineer @ Wikimedia Foundation >>> IRC: joal >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > > -- > Marko Obrovac, PhD > Senior Services Engineer > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > -- Oliver Keyes Count Logula Wikimedia Foundation _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
