As someone who has run production serving systems on top of Hadoop, I think this is risky. We've had substantial planned and unplanned downtime on the cluster (which is to be expected) and it would be bad for a pageview API to be impacted.
-Toby On Fri, Jun 12, 2015 at 9:46 AM, Andrew Otto <[email protected]> wrote: > I think we could add Impala in storage technologies to assess. > > I think we don’t want to build the pageview API on top of the Analytics > Cluster. > > > > > On Jun 12, 2015, at 05:37, Joseph Allemandou <[email protected]> > wrote: > > I think we could add Impala in storage technologies to assess. > It allows reading / computing straight from HDFS and should be fast enough > for not too bad UEx. > Maybe ? > > > On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns <[email protected]> > wrote: > >> This thread seems to have paused for 1 or 2 days now. >> >> So summarizing, the following storage technologies have been mentioned: >> >> - PostgreSQL >> - MySQL >> - Cassandra >> - Voldemort >> >> And the following concerns have been raised on using something that: >> >> - We're already familiar with >> - Permits meta-analytics >> - Is queriable for json/tsv with little user setup >> - Withstands high throughput bulk inserts >> - Is queriable for slice and dice, even if we need to precompute those >> >> It seems that there aren't many candidates and that the discussion >> focused on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one >> of each type, say PostgreSQL and Cassandra? >> >> Or, anyone with more thoughts or suggestions? >> >> >> On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns <[email protected]> >> wrote: >> >>> If we are going to completely denormalize the data sets for >>> anonymization, >>> and we expect just slice and dice queries to the database, >>> I think we wouldn't take much advantage of a relational DB, >>> because it wouldn't need to aggregate values, slice or dice, >>> all slices and dices would be precomputed, right? >>> >>> It seems to me that the nature of this denormalized/anonymized data sets >>> is more like a key-value store. That's why I suggested Voldemort at first >>> (which, they say, has a slightly faster read than Cassandra), but I see the >>> preference for Cassandra for it being a known tool inside WMF. >>> So, +1 for Cassandra! >>> >>> However, if we foresee the need of adding more data sets to the same DB, >>> or querying them in a different way, key-value store would be a limitation. >>> >>> >>> On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu <[email protected] >>> > wrote: >>> >>>> >>>> >>>> On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke <[email protected]> >>>> wrote: >>>> >>>>> On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu < >>>>> [email protected]> wrote: >>>>> >>>>>> Eric, I think we should allow arbitrary querying on any dimension for >>>>>> that first data block. We could pre-aggregate all of those combinations >>>>>> pretty easily since the dimensions have very low cardinality. >>>>>> >>>>> >>>>> Are you thinking about something like >>>>> /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more >>>>> dimensions? >>>>> >>>> >>>> only one more right now, called "agent_type". But this is just the >>>> first "cube" and we're planning a geo cube with more dimensions and are >>>> probably going to try and release data split up by access method (mobile, >>>> desktop, etc.) and other dimensions as people need them. This will be >>>> tricky as we try to protect privacy but that aside, the number of >>>> dimensions per endpoint, right now, seems to hover around 4 or 5. >>>> >>>> >>>>> >>>>> >>>>>> For the article-level data, no, we'd want just basic timeseries >>>>>> querying. >>>>>> >>>>>> Thanks Gabriel, if you could point us to an example of these >>>>>> secondary RESTBase indices, that'd be interesting. >>>>>> >>>>> >>>>> The API used to define these tables is described in >>>>> https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md, >>>>> and the algorithm used to keep those indexes up to date is described in >>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md >>>>> and largely implemented in >>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js >>>>> . >>>>> >>>> >>>> very cool, thx. >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > -- > *Joseph Allemandou* > Data Engineer @ Wikimedia Foundation > IRC: joal > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
