Hi,

The segment size and this buffer parameter are unrelated to each other. Lucene builds smaller segments during index, but they are merged at a later stage anyways, so producing larger segments from the beginning and hitting limits like you see is not required for fast search. So raising that limit does not make sense. The 32 bit adressing is a Java limitation as array sizes are limited to 2^31 minus ~10 elements. The reason for this is that IndexWriter needs to build the initial structures in Java heap and is therefore limited to the 32 bit array adressing.

Normally an index consists of several segments with approximately logarithmic size distribution. This allows to update them in faster ways and make more frequent merges of smaller segments perform faster. If you index is read-only, consider calling IndexWriter#forceMerge(1) after building it. But be aware: once you have done that, updating the index may perform very bad as the index will accumulate many deleted documents that can't be merged away. So force merge only makes sense when an index is read-only afterwards.

Uwe

Am 23.05.2025 um 14:59 schrieb Viliam Ďurina:
My index was only vectors, plus a small string ID. That is probably the
reason why it didn't hit any issue. When I added a larger text field to the
document, I've hit this exception:

Exception in thread "Thread-0" java.lang.RuntimeException:
java.lang.ArithmeticException: integer overflow
at
com.dreamer.viliamexp.querybench.CreateCommand.lambda$runEx$1(CreateCommand.java:126)
at java.base/java.lang.Thread.run(Thread.java:1575)
Caused by: java.lang.ArithmeticException: integer overflow
at java.base/java.lang.Math.addExact(Math.java:912)
at org.apache.lucene.util.ByteBlockPool.nextBuffer(ByteBlockPool.java:199)
at
org.apache.lucene.index.ByteSlicePool.allocKnownSizeSlice(ByteSlicePool.java:118)
at org.apache.lucene.index.ByteSlicePool.allocSlice(ByteSlicePool.java:98)
at
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:226)
at
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:266)
at
org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:170)
at
org.apache.lucene.index.TermsHashPerField.positionStreamSlice(TermsHashPerField.java:214)
at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:202)
at
org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1300)
at
org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1196)
at
org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:741)
at
org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:618)
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:274)
at
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)
at
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1553)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1838)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1478)
at
com.dreamer.viliamexp.querybench.CreateCommand.lambda$runEx$1(CreateCommand.java:123)
... 1 more

I suppose there are places that use 32-bit addressing, such as the
`ByteBlockPool.byteOffset` above, and perhaps others, but Lucene is
including the entire RAM usage into the limit, and therefore builds
unnecessarily small segments.

Viliam

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to