That's a school example of integer overflow. Perhaps Lucene is not designed to work with such a large single files.
On Fri, 28 Feb 2025, 10:50 Dawid Weiss, <dawid.we...@gmail.com> wrote: > Split your large file into smaller fragments and index each fragment as a > document. > > D. > > On Fri, Feb 28, 2025 at 10:30 AM Daniel Cerqueira <dan.l...@lispclub.com> > wrote: > > > Hi. I have apache-lucene version 10.1.0: > > ``` > > $ pacman -Qs apache-lucene > > local/apache-lucene 10.1.0-1 > > Apache Lucene is a high-performance, full-featured text search engine > > library written entirely in Java. > > ``` > > > > I am trying to build a lucene index for a large file. > > ``` > > $ ll > > total 2,3G > > -rw------- 1 ** ** 2,3G 2022-12-03 00:35 n-gram5_utf8.txt > > ``` > > > > Apache Lucene is blowing up with this large file. It does compute for a > > while, but then it reaches a point where this happens, before it is > > finished: > > ``` > > $ java -cp > > > /usr/share/java/apache-lucene/lucene-core-10.1.0.jar:/usr/share/java/apache-lucene/lucene-demo-10.1.0.jar:/usr/share/java/apache-lucene/lucene-analysis-common-10.1.0.jar > > org.apache.lucene.demo.IndexFiles -index . -docs n-gram5_utf8.txt > > Indexing to directory '.'... > > WARNING: A restricted method in java.lang.foreign.Linker has been called > > WARNING: java.lang.foreign.Linker::downcallHandle has been called by the > > unnamed module > > WARNING: Use --enable-native-access=ALL-UNNAMED to avoid a warning for > > this module > > > > fev. 27, 2025 3:38:14 DA TARDE > > org.apache.lucene.internal.vectorization.VectorizationProvider lookup > > WARNING: Java vector incubator module is not readable. For optimal vector > > performance, pass '--add-modules jdk.incubator.vector' to enable Vector > API. > > adding n-gram5_utf8.txt > > Exception in thread "main" java.lang.IllegalArgumentException: > startOffset > > must be non-negative, and endOffset must be >= startOffset; got > > startOffset=2147483645,endOffset=-2147483647 > > at > > > org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:125) > > at > > > org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:167) > > at > > > org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:37) > > at > > > org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51) > > at > > > org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1218) > > at > > > org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1196) > > at > > > org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:741) > > at > > > org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:618) > > at > > > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:274) > > at > > > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425) > > at > > > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1552) > > at > > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1837) > > at > > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1477) > > at > org.apache.lucene.demo.IndexFiles.indexDoc(IndexFiles.java:274) > > at > org.apache.lucene.demo.IndexFiles.indexDocs(IndexFiles.java:225) > > at org.apache.lucene.demo.IndexFiles.main(IndexFiles.java:158) > > ``` > > > > How can this be fixed, and how can I build a lucene index for this > > large file? > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >