Re: RAM-per-thread hard limit

Uwe Schindler Fri, 23 May 2025 06:39:18 -0700

Hi,

The segment size and this buffer parameter are unrelated to each other.Lucene builds smaller segments during index, but they are merged at alater stage anyways, so producing larger segments from the beginning andhitting limits like you see is not required for fast search. So raisingthat limit does not make sense. The 32 bit adressing is a Javalimitation as array sizes are limited to 2^31 minus ~10 elements. Thereason for this is that IndexWriter needs to build the initialstructures in Java heap and is therefore limited to the 32 bit arrayadressing.

Normally an index consists of several segments with approximatelylogarithmic size distribution. This allows to update them in faster waysand make more frequent merges of smaller segments perform faster. If youindex is read-only, consider calling IndexWriter#forceMerge(1) afterbuilding it. But be aware: once you have done that, updating the indexmay perform very bad as the index will accumulate many deleted documentsthat can't be merged away. So force merge only makes sense when an indexis read-only afterwards.


Uwe

Am 23.05.2025 um 14:59 schrieb Viliam Ďurina:

My index was only vectors, plus a small string ID. That is probably the
reason why it didn't hit any issue. When I added a larger text field to the
document, I've hit this exception:

Exception in thread "Thread-0" java.lang.RuntimeException:
java.lang.ArithmeticException: integer overflow
at
com.dreamer.viliamexp.querybench.CreateCommand.lambda$runEx$1(CreateCommand.java:126)
at java.base/java.lang.Thread.run(Thread.java:1575)
Caused by: java.lang.ArithmeticException: integer overflow
at java.base/java.lang.Math.addExact(Math.java:912)
at org.apache.lucene.util.ByteBlockPool.nextBuffer(ByteBlockPool.java:199)
at
org.apache.lucene.index.ByteSlicePool.allocKnownSizeSlice(ByteSlicePool.java:118)
at org.apache.lucene.index.ByteSlicePool.allocSlice(ByteSlicePool.java:98)
at
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:226)
at
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:266)
at
org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:170)
at
org.apache.lucene.index.TermsHashPerField.positionStreamSlice(TermsHashPerField.java:214)
at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:202)
at
org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1300)
at
org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1196)
at
org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:741)
at
org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:618)
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:274)
at
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)
at
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1553)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1838)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1478)
at
com.dreamer.viliamexp.querybench.CreateCommand.lambda$runEx$1(CreateCommand.java:123)
... 1 more

I suppose there are places that use 32-bit addressing, such as the
`ByteBlockPool.byteOffset` above, and perhaps others, but Lucene is
including the entire RAM usage into the limit, and therefore builds
unnecessarily small segments.

Viliam

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: RAM-per-thread hard limit

Reply via email to