Hi,
The segment size and this buffer parameter are unrelated to each other.
Lucene builds smaller segments during index, but they are merged at a
later stage anyways, so producing larger segments from the beginning and
hitting limits like you see is not required for fast search. So raising
that limit does not make sense. The 32 bit adressing is a Java
limitation as array sizes are limited to 2^31 minus ~10 elements. The
reason for this is that IndexWriter needs to build the initial
structures in Java heap and is therefore limited to the 32 bit array
adressing.
Normally an index consists of several segments with approximately
logarithmic size distribution. This allows to update them in faster ways
and make more frequent merges of smaller segments perform faster. If you
index is read-only, consider calling IndexWriter#forceMerge(1) after
building it. But be aware: once you have done that, updating the index
may perform very bad as the index will accumulate many deleted documents
that can't be merged away. So force merge only makes sense when an index
is read-only afterwards.
Uwe
Am 23.05.2025 um 14:59 schrieb Viliam Ďurina:
My index was only vectors, plus a small string ID. That is probably the
reason why it didn't hit any issue. When I added a larger text field to the
document, I've hit this exception:
Exception in thread "Thread-0" java.lang.RuntimeException:
java.lang.ArithmeticException: integer overflow
at
com.dreamer.viliamexp.querybench.CreateCommand.lambda$runEx$1(CreateCommand.java:126)
at java.base/java.lang.Thread.run(Thread.java:1575)
Caused by: java.lang.ArithmeticException: integer overflow
at java.base/java.lang.Math.addExact(Math.java:912)
at org.apache.lucene.util.ByteBlockPool.nextBuffer(ByteBlockPool.java:199)
at
org.apache.lucene.index.ByteSlicePool.allocKnownSizeSlice(ByteSlicePool.java:118)
at org.apache.lucene.index.ByteSlicePool.allocSlice(ByteSlicePool.java:98)
at
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:226)
at
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:266)
at
org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:170)
at
org.apache.lucene.index.TermsHashPerField.positionStreamSlice(TermsHashPerField.java:214)
at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:202)
at
org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1300)
at
org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1196)
at
org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:741)
at
org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:618)
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:274)
at
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)
at
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1553)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1838)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1478)
at
com.dreamer.viliamexp.querybench.CreateCommand.lambda$runEx$1(CreateCommand.java:123)
... 1 more
I suppose there are places that use 32-bit addressing, such as the
`ByteBlockPool.byteOffset` above, and perhaps others, but Lucene is
including the entire RAM usage into the limit, and therefore builds
unnecessarily small segments.
Viliam
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org