I'm all confused. 100M X 13 shards = 1.3G records, not 1.25 T But I get it 1.5 x 10^7 x 12 x 7 = 1.26 x 10 ^ 9 = 1.26 Billion, or am I off base again? But yes, at 100M records that would be 13 servers.
As for whether 100M documents/shard is reasonable... it depends (tm). There are so many variables that the *only* way is to try it with *your* data and *your* queries. Otherwise it's just guessing. Are you faceting? Sorting? Do you have 10 unique terms/field? 10M unique terms? 10B unique terms? All that stuff goes in to the mix to determine how many documents a shard can hold and still get adequate performance. Not to mention the question "what's the hardware"? A MacBook Air with 4G memory? A monster piece of metal with a bazillion gigs of memory and SSDs? All that said, and especially with trunk, 100M documents/shard is quite possible. So is 10M docs/shard. And it's not even, really, the size of the documents that solely determines the requirements, it's this weird calculation of how many docs, how many unique terms/doc and how you're searching them. I expect your documents are quite small, so that may help. Some. Try filling out the spreadsheet here: http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/ and you'll swiftly find out how hard abstract estimations are.... Best Erick On Tue, Feb 7, 2012 at 9:07 PM, Peter Miller <peter.mil...@objectconsulting.com.au> wrote: > Oops again! Turns out I got to the right result earlier by the wrong means! I > found this reference (http://www.dejavutechnologies.com/faq-solr-lucene.html) > that states shards can be up to 100,000,000 documents. So, I'm back to 13 > shards again. Phew! > > Now I'm just wondering if Cassandra/Lucandra would be a better option > anyways. If Cassandra offers some of the same advantage as OpenStack Swift > object store does, then it should be the way to go. > > Still looking for thoughts... > > Thanks, The Captn > > -----Original Message----- > From: Peter Miller [mailto:peter.mil...@objectconsulting.com.au] > Sent: Wednesday, 8 February 2012 12:20 PM > To: java-user@lucene.apache.org > Subject: RE: How best to handle a reasonable amount to data (25TB+) > > Whoops! Very poor basic maths, I should have written it down. I was thinking > 13 shards. But yes, 13,000 is a bit different. Now I'm in even more need of > help. > > How is "easy" - 15 million audit records a month, coming from several active > systems, and a requirement to keep and search across seven years of data. > > <Goes off to do more googling> > > Thanks a lot, > The Captn > > -----Original Message----- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Wednesday, 8 February 2012 12:39 AM > To: java-user@lucene.apache.org > Subject: Re: How best to handle a reasonable amount to data (25TB+) > > I'm curious what the nature of your data is such that you have 1.25 trillion > documents. Even at 100M/shard, you're still talking 12,500 shards. The > "laggard" > problem will rear it's ugly > head, not to mention the administration of that many machines will be, shall > we say, non-trivial... > > Best > Erick > > On Mon, Feb 6, 2012 at 11:17 PM, Peter Miller > <peter.mil...@objectconsulting.com.au> wrote: >> Thanks for the response. Actually, I am more concerned with trying to use an >> Object Store for the indexes. The next concern is the use of a local index >> versus the sharded ones, but I'm more relaxed about that now after thinking >> about it. I see that index shards could be up to 100 million documents, so >> that makes the 1.25 trillion number look reasonable. >> >> Any other thoughts? >> >> Thanks, >> The Captn. >> >> -----Original Message----- >> From: ppp c [mailto:peter.c.e...@gmail.com] >> Sent: Monday, 6 February 2012 5:29 PM >> To: java-user@lucene.apache.org >> Subject: Re: How best to handle a reasonable amount to data (25TB+) >> >> it sounds not an issue of lucene but the logic of your app. >> if you're afraid too many docs in one index you can make multiple indexes. >> And then search across them, then merge, then over. >> >> On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller < >> peter.mil...@objectconsulting.com.au> wrote: >> >>> Hi, >>> >>> I have a little bit of an unusual set of requirements, and I am >>> looking for advice. I have researched the archives, and seen some >>> relevant posts, but they are fairly old and not specifically a match, >>> so I thought I would give this a try. >>> >>> We will eventually have about 50TB raw, non-searchable data and 25TB >>> of search attributes to handle in Lucene, across about 1.25 trillion >>> documents. The app is write once, read many. There are many document >>> types involved that have to be able to be searched separately or >>> together, with some common attributes, but also unique ones per type. >>> I plan on using a JCP implementation that uses Lucene under the >>> covers. The data itself is not searchable, only the attributes. I >>> plan to hook the JCP repo >>> (ModeShape) up to the OpenStack Object Storage on commodity hardware >>> eventually with 5 machines, each with 24 x 2TB drives. This should >>> allow for redundancy (3 copies), although I would suppose we would >>> add bigger drives as we go on. >>> >>> Since there is such a lot of data to index (not outrageous amounts >>> for these days, but a bit chunky), I was sort of assuming that the >>> Lucene indexes would go on the object storage solution too, to handle >>> availability and other infrastructure issues. Most of the searches >>> would be date-constrained, so I thought that the indexes could be sharded >>> by date. >>> >>> There would be a local disk index being built near real time on the >>> JCP hardware that could be regularly merged in with the main indexes >>> on the object storage, I suppose. >>> >>> Does that make sense, and would it work? Sorry, but this is just >>> theoretical at the moment and I'm not experienced in Lucene, as you >>> can no doubt tell. >>> >>> I came across a piece that was talking about Hardoop and distributed >>> Solr, http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/, >>> and I'm now wondering if that would be a superior approach? Or any other >>> suggestions? >>> >>> Many Thanks, >>> The Captn >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org