Deleted documents use IDs, so you may run out of doc IDs with fewer than 2^31 searchable documents.
I recommend designing with a lot of slack, maybe using only 75% of IDs. Solr might alert when 90% of the space is used. If you want to delete everything, then re-add everything without a commit, you will use 2X the doc IDs. That isn't even worst case. If you reduce or black-out merging, you can end up with serious doc ID consumption. With no merges, if you find lots of near-dupes and routinely replace documents with a better version, you can have many deleted documents for each searchable one. This can happen with web spidering. If you find five mirrors of a million-document site, and find the best one last, you can use five million doc IDs for those million docs. wunder On May 30, 2012, at 8:52 AM, Jack Krupansky wrote: > AFAICT, there is no clear documentation of the maximum number of documents > that can be stored in a Lucene or Solr Index (single core/shard). It appears > to be 2^31 since a Lucene document number and the value returned from > IW.maxDoc is a Java “int”. Lucene users have that “hint” to guide them, but > that hint is never surfaced for Solr users, AFAICT. A few years ago nobody in > their right mind would imagine indexing 2 billion documents in a single > machine/core, but now people are at least tempted to try. So, it is now more > important for people to know about it, up front, not hidden down in the fine > print of Lucene file formats. > > I wanted to file a Jira on this, but I wanted to check first if anybody knows > of an existing Jira for it that maybe was worded in a way that it escaped my > semi-diligent searches. > > I was also thinking of filing it as two Jiras, one for Lucene and one for > Solr since the doc would be in different places. Or, should there be one > combined “Lucene/Solr Capacity Limits/Planning” wiki? Unless somebody > objects, I’ll file as two separate (but linked) issues. > > And, I was also thinking of filing two Jiras for Lucene and Solr to each have > a robust check for exceeding the underlying Lucene limit and reporting this > exception in a well-defined manner rather than “numFound” or “maxDoc” going > negative. But this is separate from the documentation issue, I think. Unless > somebody objects, I’ll file these as two separate issues. > > Any objection to me filing these four issues? > > -- Jack Krupansky