Re: apache-lucene blowing up with large file

Dawid Weiss Sat, 01 Mar 2025 08:38:25 -0800

The simple answer is - split your large text document into smaller
documents, then use the same command but give it the folder where those
smaller
fragments are.


This said, I think you should take a look at using the Java API directly,
Daniel. You'll have a lot more control over how you index your document(s)
and how you can then query those documents. You can even start with the
source of IndexFiles (the demo class).

> That's a school example of integer overflow. Perhaps Lucene is not
designed to work with such a large single files

Correct. Token offsets and positions within a document are integers and
such a large document overflows an int range. It's also very unusual to
index such a large file - how would you then retrieve it or highlight it?

Dawid



On Sat, Mar 1, 2025 at 1:10 AM Daniel Cerqueira <dan.l...@lispclub.com>
wrote:

> > On Fri, Feb 28, 2025 at 10:30 AM Daniel Cerqueira <dan.l...@lispclub.com
> >
> > wrote:
> >
> >> Hi. I have apache-lucene version 10.1.0:
> >> ```
> >> $ pacman -Qs apache-lucene
> >> local/apache-lucene 10.1.0-1
> >>     Apache Lucene is a high-performance, full-featured text search
> engine
> >> library written entirely in Java.
> >> ```
> >>
> >> I am trying to build a lucene index for a large file.
> >> ```
> >> $ ll
> >> total 2,3G
> >> -rw------- 1 ** ** 2,3G 2022-12-03 00:35 n-gram5_utf8.txt
> >> ```
> >>
> >> Apache Lucene is blowing up with this large file. It does compute for a
> >> while, but then it reaches a point where this happens, before it is
> >> finished:
> >> ```
> >> $ java -cp
> >>
> /usr/share/java/apache-lucene/lucene-core-10.1.0.jar:/usr/share/java/apache-lucene/lucene-demo-10.1.0.jar:/usr/share/java/apache-lucene/lucene-analysis-common-10.1.0.jar
> >> org.apache.lucene.demo.IndexFiles -index . -docs n-gram5_utf8.txt
> >> Indexing to directory '.'...
> >> WARNING: A restricted method in java.lang.foreign.Linker has been called
> >> WARNING: java.lang.foreign.Linker::downcallHandle has been called by the
> >> unnamed module
> >> WARNING: Use --enable-native-access=ALL-UNNAMED to avoid a warning for
> >> this module
> >>
> >> fev. 27, 2025 3:38:14 DA TARDE
> >> org.apache.lucene.internal.vectorization.VectorizationProvider lookup
> >> WARNING: Java vector incubator module is not readable. For optimal
> vector
> >> performance, pass '--add-modules jdk.incubator.vector' to enable Vector
> API.
> >> adding n-gram5_utf8.txt
> >> Exception in thread "main" java.lang.IllegalArgumentException:
> startOffset
> >> must be non-negative, and endOffset must be >= startOffset; got
> >> startOffset=2147483645,endOffset=-2147483647
> >>         at
> >>
> org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:125)
> >>         at
> >>
> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:167)
> >>         at
> >>
> org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:37)
> >>         at
> >>
> org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
> >>         at
> >>
> org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1218)
> >>         at
> >>
> org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1196)
> >>         at
> >>
> org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:741)
> >>         at
> >>
> org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:618)
> >>         at
> >>
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:274)
> >>         at
> >>
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)
> >>         at
> >>
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1552)
> >>         at
> >>
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1837)
> >>         at
> >> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1477)
> >>         at
> org.apache.lucene.demo.IndexFiles.indexDoc(IndexFiles.java:274)
> >>         at
> org.apache.lucene.demo.IndexFiles.indexDocs(IndexFiles.java:225)
> >>         at org.apache.lucene.demo.IndexFiles.main(IndexFiles.java:158)
> >> ```
> >>
> >> How can this be fixed, and how can I build a lucene index for this
> >> large file?
>
>
> Dawid Weiss <dawid.we...@gmail.com> writes:
>
> > Split your large file into smaller fragments and index each fragment as a
> > document.
>
>
> I know how to split a text file. How do I index each fragment as a
> document?
>

Re: apache-lucene blowing up with large file

Reply via email to