Tens of thousands eh? I've had ~100-150 running and that worked fine. I could see issues with Blurs table tracking since its zookeeper backed, and zk doesn't like massive directories like that. Then again Blur has a caching system built into it for its meta data, so maybe it would be ok?
Are the table structures going to be different? Is there any reasonable grouping you could do of the customers? Perhaps the small ones could live together in a larger index? On Fri, Dec 21, 2012 at 11:08 AM, Aaron McCurry <[email protected]> wrote: > I agree with Garret. We run ~100 tables with the shard count varying from > 1 shard to over 1000 in a single table. How many tables will you have? > > Yes Blur works on CDH3U2. It should work on any 0.20.x (1.0.x) version of > Hadoop. However if HDFS doesn't support appends then the write ahead log > won't function correctly. Meaning it won't actually preserve the data. > > Aaron > > > On Fri, Dec 21, 2012 at 10:59 AM, Garrett Barton > <[email protected]>wrote: > > > If I understand you correctly you have data from multiple customers > > (denoted by a customer_id) and you only perform a search against a single > > customer at a time? If that's the case the separate index route might > be a > > good idea as you can rebuild them separately, and you can model them > > differently potentially if you have a need. Having said that, if you > also > > occasionally want to search across customers, then you would want them > all > > in a single index. > > > > I have Blur 1.x running on CDH3U5, I think it will work back down to > CDH3U2 > > at least, and that's hadoop 0.20 in both cases. Have not tried 0.23 > though > > I will be needing to soon. > > > > > > On Fri, Dec 21, 2012 at 10:51 AM, James Kebinger <[email protected] > > >wrote: > > > > > Hello, I'm hoping to kick the tires on apache blur in the near future. > I > > > have a couple of quick questions before I set out. > > > > > > What version(s) of hadoop are required/supported at present? > > > > > > We have lots of data to index, but we always search within a particular > > > customer's data set. Would the best practice be to put all of the data > in > > > one table and have the customer id in all of the queries, or build > > separate > > > tables for each customer_id (like users-1, users-123 etc). > > > > > > Thanks, and happy holidays! > > > > > > -James Kebinger > > > > > >
