Hi Andy, We are currently indexing about 650,000 full-text books in per Solr/Lucene index. We have 10 shards for a total of about 6.5 million documents and our average response time is under a 2 seconds, but the slowest 1% of queries take between 5-30 seconds. If you were searching only one index of 650,000 documents instead of the 6.5 million, the response time would quite a bit better. If you only allow boolean "AND" queries and use stopwords, the response time would be significantly better. Our slowest searches are almost all phrase queries with common words.
You probably need to define what you mean by "searched quickly" and what kind of a load you are expecting. Also you need to think about what kind of hardware you want to use. Also as index sizes get large,disk I/O can become a bottleneck. Using more memory for the OS disk cache and Solr/Lucene caches can compensate for this. Using SSD's instead of Hard Disks can also offset this as Toke can tell you about. If you need to do frequent index updates it can invalidate both the OS I/O cache and the Solr/Lucene caches, so there are lots of trade-offs to tune. Lucene had a limit of about 2.4 billion unique terms per segment, which we ran into because we have dirty OCR and 200 languages (http://www.hathitrust.org/blogs/large-scale-search/too-many-words). However Michael McCandless changed the limit to about 274 billion unique terms. Chances are you will run into bottlenecks with disk I/O or other bottlenecks, long before you reach this limit. BTW: we index whole books as Solr documents, not chapters or pages. Tom www.hathitrust.org/blogs ________________________________________ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org