My index was only vectors, plus a small string ID. That is probably the
reason why it didn't hit any issue. When I added a larger text field to the
document, I've hit this exception:

Exception in thread "Thread-0" java.lang.RuntimeException:
java.lang.ArithmeticException: integer overflow
at
com.dreamer.viliamexp.querybench.CreateCommand.lambda$runEx$1(CreateCommand.java:126)
at java.base/java.lang.Thread.run(Thread.java:1575)
Caused by: java.lang.ArithmeticException: integer overflow
at java.base/java.lang.Math.addExact(Math.java:912)
at org.apache.lucene.util.ByteBlockPool.nextBuffer(ByteBlockPool.java:199)
at
org.apache.lucene.index.ByteSlicePool.allocKnownSizeSlice(ByteSlicePool.java:118)
at org.apache.lucene.index.ByteSlicePool.allocSlice(ByteSlicePool.java:98)
at
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:226)
at
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:266)
at
org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:170)
at
org.apache.lucene.index.TermsHashPerField.positionStreamSlice(TermsHashPerField.java:214)
at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:202)
at
org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1300)
at
org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1196)
at
org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:741)
at
org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:618)
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:274)
at
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)
at
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1553)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1838)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1478)
at
com.dreamer.viliamexp.querybench.CreateCommand.lambda$runEx$1(CreateCommand.java:123)
... 1 more

I suppose there are places that use 32-bit addressing, such as the
`ByteBlockPool.byteOffset` above, and perhaps others, but Lucene is
including the entire RAM usage into the limit, and therefore builds
unnecessarily small segments.

Viliam

Reply via email to