Re: apache-lucene blowing up with large file

Daniel Cerqueira Fri, 28 Feb 2025 16:10:55 -0800

> On Fri, Feb 28, 2025 at 10:30 AM Daniel Cerqueira <[email protected]>
> wrote:
>
>> Hi. I have apache-lucene version 10.1.0:
>> ```
>> $ pacman -Qs apache-lucene
>> local/apache-lucene 10.1.0-1
>>     Apache Lucene is a high-performance, full-featured text search engine
>> library written entirely in Java.
>> ```
>>
>> I am trying to build a lucene index for a large file.
>> ```
>> $ ll
>> total 2,3G
>> -rw------- 1 ** ** 2,3G 2022-12-03 00:35 n-gram5_utf8.txt
>> ```
>>
>> Apache Lucene is blowing up with this large file. It does compute for a
>> while, but then it reaches a point where this happens, before it is
>> finished:
>> ```
>> $ java -cp
>> /usr/share/java/apache-lucene/lucene-core-10.1.0.jar:/usr/share/java/apache-lucene/lucene-demo-10.1.0.jar:/usr/share/java/apache-lucene/lucene-analysis-common-10.1.0.jar
>> org.apache.lucene.demo.IndexFiles -index . -docs n-gram5_utf8.txt
>> Indexing to directory '.'...
>> WARNING: A restricted method in java.lang.foreign.Linker has been called
>> WARNING: java.lang.foreign.Linker::downcallHandle has been called by the
>> unnamed module
>> WARNING: Use --enable-native-access=ALL-UNNAMED to avoid a warning for
>> this module
>>
>> fev. 27, 2025 3:38:14 DA TARDE
>> org.apache.lucene.internal.vectorization.VectorizationProvider lookup
>> WARNING: Java vector incubator module is not readable. For optimal vector
>> performance, pass '--add-modules jdk.incubator.vector' to enable Vector API.
>> adding n-gram5_utf8.txt
>> Exception in thread "main" java.lang.IllegalArgumentException: startOffset
>> must be non-negative, and endOffset must be >= startOffset; got
>> startOffset=2147483645,endOffset=-2147483647
>>         at
>> org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:125)
>>         at
>> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:167)
>>         at
>> org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:37)
>>         at
>> org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
>>         at
>> org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1218)
>>         at
>> org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1196)
>>         at
>> org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:741)
>>         at
>> org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:618)
>>         at
>> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:274)
>>         at
>> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)
>>         at
>> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1552)
>>         at
>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1837)
>>         at
>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1477)
>>         at org.apache.lucene.demo.IndexFiles.indexDoc(IndexFiles.java:274)
>>         at org.apache.lucene.demo.IndexFiles.indexDocs(IndexFiles.java:225)
>>         at org.apache.lucene.demo.IndexFiles.main(IndexFiles.java:158)
>> ```
>>
>> How can this be fixed, and how can I build a lucene index for this
>> large file?

Dawid Weiss <[email protected]> writes:

> Split your large file into smaller fragments and index each fragment as a
> document.

I know how to split a text file. How do I index each fragment as a
document?

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: apache-lucene blowing up with large file

Reply via email to