[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-06 Thread Laura Dietz (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314939#comment-16314939
 ] 

Laura Dietz commented on LUCENE-8118:
-

+1

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
> Attachments: LUCENE-8118_test.patch
>
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314869#comment-16314869
 ] 

Robert Muir commented on LUCENE-8118:
-

Dawid it is not complicated in this case. It is *trivial* to fix.

Again to explain:

* With *addDocument* you don't hit OOM and you dont need a huge heap. just keep 
indexing documents and lucene will flush to disk appropriately. 
* With *addDocumentS* it will try to add anything you pass all atomically as 
one "transaction".

There are a couple problems here. First is the method's name (addDocuments is 
*not* the plural form of addDocument, its something totally different 
alltogether. It needs to be addDocumentsAtomic or addDocumentsBlock or 
something else, anything else. Its also missing bounds checks which is why you 
see the AIOOBE, those need to be added.

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
> Attachments: LUCENE-8118_test.patch
>
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
> 

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-06 Thread Laura Dietz (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314853#comment-16314853
 ] 

Laura Dietz commented on LUCENE-8118:
-

Dawid, my computer has plenty of RAM, which is why I never see an OOM exception 
and always get the AIOOBE. 


> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
> Attachments: LUCENE-8118_test.patch
>
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-06 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314698#comment-16314698
 ] 

Dawid Weiss commented on LUCENE-8118:
-

OOMs are complicated in general because once you hit one, there's a very real 
risk that you won't be able to recover anyway (even constructing a new 
exception message typically requires memory allocation and this just goes on 
and on in a vicious cycle). I remember thinking about it a lot in the early 
days of randomizedrunner, but without any constructive conclusions. I tried 
preallocating stuff in advance (not possible in all cases) and workarounds like 
keeping a memory buffer that is made reclaimable on OOM (so that there's some 
memory available before we hit the next one)... these are hacks more than 
solutions and they don't always work anyway (as in when you have background 
heap-competing threads...).

I like Java, but it starts to show its wrinkles. :(



> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
> Attachments: LUCENE-8118_test.patch
>
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> 

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314548#comment-16314548
 ] 

Michael McCandless commented on LUCENE-8118:


bq. Well, I think a simple limit can work. For this API, e.g a simple counter, 
throw exc if the iterator has over 100k docs.

+1

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
> Attachments: LUCENE-8118_test.patch
>
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314545#comment-16314545
 ] 

Robert Muir commented on LUCENE-8118:
-

Well, I think a simple limit can work. For this API, e.g a simple counter, 
throw exc if the iterator has over 100k docs.

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
> Attachments: LUCENE-8118_test.patch
>
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314514#comment-16314514
 ] 

Michael McCandless commented on LUCENE-8118:


Note that committing only once at the end is entirely normal and often exactly 
the right choice.

It's hard to know how to fix this -- we could add a best effort check that if 
the RAM usage of that one in-memory segment (DWPT) exceeds the hard limit 
({{IWC.setRAMPerThreadHardLimitMB}}) we throw a better exception?

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
> Attachments: LUCENE-8118_test.patch
>
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> 

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313744#comment-16313744
 ] 

Robert Muir commented on LUCENE-8118:
-

the test had to work hard to hit AIOOBE instead of OOM. 

I think most users that do something like this will hit OOM which is just as 
confusing and bad. it may technically be a different problem but due to the 
names of the methods and the apis, i think its easy someone will hit it too. 
Seems like add/updateDocuments need some sanity checks...

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
> Attachments: LUCENE-8118_test.patch
>
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> 

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313337#comment-16313337
 ] 

Robert Muir commented on LUCENE-8118:
-

yeah, but we still need to fix the case where someone passes too many documents 
for addDocuments to succeed: it needs to be better than AIOOBE.

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To 

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-05 Thread Laura Dietz (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313323#comment-16313323
 ] 

Laura Dietz commented on LUCENE-8118:
-

I think my mistake was to abuse addDocuments(iterator).

I switched to addDocument(doc) with a commit every so often (see master branch)


> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To 

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313322#comment-16313322
 ] 

Robert Muir commented on LUCENE-8118:
-

whatever we decide to do, we can be sure that AIOOBE is not the right answer :)

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For 

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313318#comment-16313318
 ] 

Robert Muir commented on LUCENE-8118:
-

Well, I understand the bug, but not sure what the fix is.

Indexing code implements Iterable etc to pull in the docs, and makes one single 
call to addDocuments().

This is supposed to be an "atomic add" of multiple documents at once which 
gives certain guarantees: needed for nested documents and features like that so 
they document IDs will be aligned in a particular way.

In your case, its too much data, IndexWriter isn't going to be able to do 200M 
docs in one operation like this.

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
>

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313299#comment-16313299
 ] 

Robert Muir commented on LUCENE-8118:
-

It is nothing like that, it is simply a bug.

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-05 Thread Laura Dietz (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313292#comment-16313292
 ] 

Laura Dietz commented on LUCENE-8118:
-

Robert, that would be even better!

It is difficult to guess what the right interval of issuing a commits is. I 
understand that some hand tuning might be necessary to get the highest 
performance for given resource constraints. If the issue is a buffer that is 
filling up, it would be helpful to have some form of an emergency auto-commit.

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at 
> 

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-05 Thread Diego Ceccarelli (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313284#comment-16313284
 ] 

Diego Ceccarelli commented on LUCENE-8118:
--

I agree, that was just a workaround for [~laura-dietz] :) 

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional 

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313271#comment-16313271
 ] 

Robert Muir commented on LUCENE-8118:
-

Issuing unnecessary commits is just masking the issue: you shouldn't see this 
exception.

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-05 Thread Laura Dietz (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313256#comment-16313256
 ] 

Laura Dietz commented on LUCENE-8118:
-

Yes, that works - Thanks, Diego!

I think I could have been helped with an Exception message that indicates 
"Buffer full, call index.commit!"




> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To 

[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2018-01-05 Thread Diego Ceccarelli (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312971#comment-16312971
 ] 

Diego Ceccarelli commented on LUCENE-8118:
--

Looking at your code it seems that there is only one commit at the end, and 
your collection is big. Could you please try to commit every, let's say, 50k 
docs?  

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)