Re: Documenting document limits for Lucene and Solr

2012-05-31 Thread Walter Underwood
Deleted documents use IDs, so you may run out of doc IDs with fewer than 2^31 
searchable documents.

I recommend designing with a lot of slack, maybe using only 75% of IDs. Solr 
might alert when 90% of the space is used.

If you want to delete everything, then re-add everything without a commit, you 
will use 2X the doc IDs. That isn't even worst case.

If you reduce or black-out merging, you can end up with serious doc ID 
consumption.

With no merges, if you find lots of near-dupes and routinely replace documents 
with a better version, you can have many deleted documents for each searchable 
one. This can happen with web spidering. If you find five mirrors of a 
million-document site, and find the best one last, you can use five million doc 
IDs for those million docs.

wunder

On May 30, 2012, at 8:52 AM, Jack Krupansky wrote:

 AFAICT, there is no clear documentation of the maximum number of documents 
 that can be stored in a Lucene or Solr Index (single core/shard). It appears 
 to be 2^31 since a Lucene document number and the value returned from 
 IW.maxDoc is a Java “int”. Lucene users have that “hint” to guide them, but 
 that hint is never surfaced for Solr users, AFAICT. A few years ago nobody in 
 their right mind would imagine indexing 2 billion documents in a single 
 machine/core, but now people are at least tempted to try. So, it is now more 
 important for people to know about it, up front, not hidden down in the fine 
 print of Lucene file formats.
  
 I wanted to file a Jira on this, but I wanted to check first if anybody knows 
 of an existing Jira for it that maybe was worded in a way that it escaped my 
 semi-diligent searches.
  
 I was also thinking of filing it as two Jiras, one for Lucene and one for 
 Solr since the doc would be in different places. Or, should there be one 
 combined “Lucene/Solr Capacity Limits/Planning” wiki? Unless somebody 
 objects, I’ll file as two separate (but linked) issues.
  
 And, I was also thinking of filing two Jiras for Lucene and Solr to each have 
 a robust check for exceeding the underlying Lucene limit and reporting this 
 exception in a well-defined manner rather than “numFound” or “maxDoc” going 
 negative. But this is separate from the documentation issue, I think. Unless 
 somebody objects, I’ll file these as two separate issues.
  
 Any objection to me filing these four issues?
 
 -- Jack Krupansky






Re: Documenting document limits for Lucene and Solr

2012-05-31 Thread Jack Krupansky
Thanks. That’s all good info to be documented for users to be aware of when 
they start pushing the limits.

-- Jack Krupansky

From: Walter Underwood 
Sent: Thursday, May 31, 2012 1:30 PM
To: dev@lucene.apache.org 
Subject: Re: Documenting document limits for Lucene and Solr

Deleted documents use IDs, so you may run out of doc IDs with fewer than 2^31 
searchable documents. 

I recommend designing with a lot of slack, maybe using only 75% of IDs. Solr 
might alert when 90% of the space is used.

If you want to delete everything, then re-add everything without a commit, you 
will use 2X the doc IDs. That isn't even worst case.

If you reduce or black-out merging, you can end up with serious doc ID 
consumption.

With no merges, if you find lots of near-dupes and routinely replace documents 
with a better version, you can have many deleted documents for each searchable 
one. This can happen with web spidering. If you find five mirrors of a 
million-document site, and find the best one last, you can use five million doc 
IDs for those million docs.

wunder

On May 30, 2012, at 8:52 AM, Jack Krupansky wrote:


  AFAICT, there is no clear documentation of the maximum number of documents 
that can be stored in a Lucene or Solr Index (single core/shard). It appears to 
be 2^31 since a Lucene document number and the value returned from IW.maxDoc is 
a Java “int”. Lucene users have that “hint” to guide them, but that hint is 
never surfaced for Solr users, AFAICT. A few years ago nobody in their right 
mind would imagine indexing 2 billion documents in a single machine/core, but 
now people are at least tempted to try. So, it is now more important for people 
to know about it, up front, not hidden down in the fine print of Lucene file 
formats.

  I wanted to file a Jira on this, but I wanted to check first if anybody knows 
of an existing Jira for it that maybe was worded in a way that it escaped my 
semi-diligent searches.

  I was also thinking of filing it as two Jiras, one for Lucene and one for 
Solr since the doc would be in different places. Or, should there be one 
combined “Lucene/Solr Capacity Limits/Planning” wiki? Unless somebody objects, 
I’ll file as two separate (but linked) issues.

  And, I was also thinking of filing two Jiras for Lucene and Solr to each have 
a robust check for exceeding the underlying Lucene limit and reporting this 
exception in a well-defined manner rather than “numFound” or “maxDoc” going 
negative. But this is separate from the documentation issue, I think. Unless 
somebody objects, I’ll file these as two separate issues.

  Any objection to me filing these four issues?

  -- Jack Krupansky






Documenting document limits for Lucene and Solr

2012-05-30 Thread Jack Krupansky
AFAICT, there is no clear documentation of the maximum number of documents that 
can be stored in a Lucene or Solr Index (single core/shard). It appears to be 
2^31 since a Lucene document number and the value returned from IW.maxDoc is a 
Java “int”. Lucene users have that “hint” to guide them, but that hint is never 
surfaced for Solr users, AFAICT. A few years ago nobody in their right mind 
would imagine indexing 2 billion documents in a single machine/core, but now 
people are at least tempted to try. So, it is now more important for people to 
know about it, up front, not hidden down in the fine print of Lucene file 
formats.

I wanted to file a Jira on this, but I wanted to check first if anybody knows 
of an existing Jira for it that maybe was worded in a way that it escaped my 
semi-diligent searches.

I was also thinking of filing it as two Jiras, one for Lucene and one for Solr 
since the doc would be in different places. Or, should there be one combined 
“Lucene/Solr Capacity Limits/Planning” wiki? Unless somebody objects, I’ll file 
as two separate (but linked) issues.

And, I was also thinking of filing two Jiras for Lucene and Solr to each have a 
robust check for exceeding the underlying Lucene limit and reporting this 
exception in a well-defined manner rather than “numFound” or “maxDoc” going 
negative. But this is separate from the documentation issue, I think. Unless 
somebody objects, I’ll file these as two separate issues.

Any objection to me filing these four issues?

-- Jack Krupansky