Re: Documenting document limits for Lucene and Solr

Walter Underwood Thu, 31 May 2012 10:30:51 -0700

Deleted documents use IDs, so you may run out of doc IDs with fewer than 2^31 
searchable documents.

I recommend designing with a lot of slack, maybe using only 75% of IDs. Solr 
might alert when 90% of the space is used.

If you want to delete everything, then re-add everything without a commit, you 
will use 2X the doc IDs. That isn't even worst case.

If you reduce or black-out merging, you can end up with serious doc ID 
consumption.

With no merges, if you find lots of near-dupes and routinely replace documents 
with a better version, you can have many deleted documents for each searchable 
one. This can happen with web spidering. If you find five mirrors of a 
million-document site, and find the best one last, you can use five million doc 
IDs for those million docs.

wunder

On May 30, 2012, at 8:52 AM, Jack Krupansky wrote:

> AFAICT, there is no clear documentation of the maximum number of documents 
> that can be stored in a Lucene or Solr Index (single core/shard). It appears 
> to be 2^31 since a Lucene document number and the value returned from 
> IW.maxDoc is a Java “int”. Lucene users have that “hint” to guide them, but 
> that hint is never surfaced for Solr users, AFAICT. A few years ago nobody in 
> their right mind would imagine indexing 2 billion documents in a single 
> machine/core, but now people are at least tempted to try. So, it is now more 
> important for people to know about it, up front, not hidden down in the fine 
> print of Lucene file formats.
>  
> I wanted to file a Jira on this, but I wanted to check first if anybody knows 
> of an existing Jira for it that maybe was worded in a way that it escaped my 
> semi-diligent searches.
>  
> I was also thinking of filing it as two Jiras, one for Lucene and one for 
> Solr since the doc would be in different places. Or, should there be one 
> combined “Lucene/Solr Capacity Limits/Planning” wiki? Unless somebody 
> objects, I’ll file as two separate (but linked) issues.
>  
> And, I was also thinking of filing two Jiras for Lucene and Solr to each have 
> a robust check for exceeding the underlying Lucene limit and reporting this 
> exception in a well-defined manner rather than “numFound” or “maxDoc” going 
> negative. But this is separate from the documentation issue, I think. Unless 
> somebody objects, I’ll file these as two separate issues.
>  
> Any objection to me filing these four issues?
> 
> -- Jack Krupansky

Re: Documenting document limits for Lucene and Solr

Reply via email to