RE: scalability limit in terms of numbers of large documents

Burton-West, Tom Mon, 16 Aug 2010 09:00:51 -0700

Hi Andy,

We are currently indexing about 650,000 full-text books in per Solr/Lucene 
index.   We have 10 shards  for a total of about 6.5 million documents and our 
average response time is under a 2 seconds, but the slowest 1% of queries take 
between 5-30 seconds.  If you were searching only one index of 650,000 
documents instead of the 6.5 million, the response time would quite a bit 
better. If you only allow boolean "AND" queries and use stopwords, the response 
time would be significantly better.  Our slowest searches are almost all phrase 
queries with common words.


You probably need to define what you mean by "searched quickly" and what kind 
of a load you are expecting. Also you need to think about what kind of hardware 
you want to use. Also as index sizes get large,disk I/O can become a 
bottleneck.   Using more memory for the OS disk cache and Solr/Lucene caches 
can compensate for this. Using SSD's instead of Hard Disks can also offset this 
as Toke can tell you about.  If you need to do frequent index updates it can 
invalidate both the OS I/O cache and the Solr/Lucene caches, so there are lots 
of trade-offs to tune. 

Lucene had a limit of about 2.4 billion unique terms per segment, which we ran 
into because we have dirty OCR and 200 languages 
(http://www.hathitrust.org/blogs/large-scale-search/too-many-words).  However 
Michael McCandless changed the limit to about 274 billion unique terms.  
Chances are you will run into bottlenecks with disk I/O  or other bottlenecks, 
long before you reach this limit.

BTW: we index whole books as Solr documents, not chapters or pages.

Tom 
www.hathitrust.org/blogs
________________________________________





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: scalability limit in terms of numbers of large documents

Reply via email to