Hi Michael,

The index that had an issue when merging into one segment definitely had
more than 1 billion times the word "positional" in it. I hope to be able to give a closer number once re-indexing finished with a "work-around".

Of course the "work-around" is to just fix this correctly by not having that word so often in the index and definitely not as docs, freqs and postings.

For background information.

The use case was to find a set of documents that where either "positional" or "non-positional". This was present in the first check in of our code 18 years ago! since then our data has grown a bit ;) The code was using Lucene 1.4.3 at that time. Users would search using this as what now would be a facet `type:positional`. I changed this to a field only IndexOptions.DOCS which is called 'positional' and searched as `positional:yes` rewriting the previous query syntax behind the scene to not break any user tools.

Regards,
Jerven



On 5/14/24 17:43, Michael McCandless wrote:
I think we should at least open an issue to try to improve the exception message?  We might catch the exception higher up (where we know the field name) and rethrow with the field name, maybe.  We can discuss options on the issue ...

If you are not using custom term frequencies it's not clear to me how you could even have hit this exception.

Mike McCandless

http://blog.mikemccandless.com <http://blog.mikemccandless.com>


On Tue, May 7, 2024 at 4:10 PM Michael Sokolov <msoko...@gmail.com <mailto:msoko...@gmail.com>> wrote:

    This is definitely a confusing error condition. If we can add more
    information without creating an undue burden for the indexer it would
    be nice, but I think this will be very challenging here since the
    exception is thrown at a low level in the code where there might not
    be a lot of useful info (ie the field name) to provide. And I expect
    there are other places that make a similar assumption we would have to
    track down?

    On Tue, May 7, 2024 at 9:10 AM Jerven Tjalling Bolleman
    <jerven.bolleman@sib.swiss> wrote:
     >
     > Dear Michael,
     >
     > Looking deeper into this. I think we overflowed a term frequency
    field.
     > Looking in some statistics, in a previous release we had
    1,288,526,281
     > of a certain field, this would be larger now. Each of these would
    have
     > had a limited set of values. But crucially nearly all of them
    would have
     > had the term "positional" or "non-positional" added to the document.
     >
     > There is no good reason to do this today, we should just turn
    this into
     > a boolean field and update the UI. I will do this and report back.
     >
     > Do you think that a patch for a try/catch for a more informative log
     > message be appreciated by the community? e.g. mentioning the
    field name
     > in the exception?
     >
     > Regards,
     > Jerven
     >
     > On 5/7/24 14:52, Jerven Tjalling Bolleman wrote:
     > > Dear Michael,
     > >
     > > Thank you for your help.
     > >
     > > We don't use custom term frequencies (I just double checked
    with a code
     > > search).
     > > We also always merge down to one segment (historical but also
    we index
     > > once and then there are no changes for a week to a month and
    then we
     > > reindex every document from scratch).
     > >
     > > Your response is very helpful already and I very much
    appreciate it as
     > > it cuts down the search space significantly.
     > >
     > > Regards,
     > > Jerven
     > >
     > >
     > > On 5/7/24 14:03, Michael Sokolov wrote:
     > >> It seems as if the term frequency for some term exceeded the
    maximum.
     > >> This can happen if you supplied custom term frequencies eg with
     > >>
    
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/analysis/tokenattributes/TermFrequencyAttribute.html?is-external=true
 
<https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/analysis/tokenattributes/TermFrequencyAttribute.html?is-external=true>
     > >> . The behavior didn't change since 8.x but it's possible that the
     > >> merging brought together some very "high frequency" terms that
    were
     > >> previously not in the same segment?
     > >>
     > >> On Tue, May 7, 2024 at 4:03 AM Jerven Tjalling Bolleman
     > >> <jerven.bolleman@sib.swiss> wrote:
     > >>>
     > >>> Dear Lucene community,
     > >>>
     > >>> This morning I found this exception in our logs. This was the
    first time
     > >>> we indexed this data with lucene 9.10. Before we were still
    on the
     > >>> lucene 8.x branch. between the last indexing with 8 and this
    one with
     > >>> 9.10 we have a bit more data so it could be something else
    that went
     > >>> over an limit.
     > >>>
     > >>> Unfortunately, from this log message I am at a loss for what
    is going
     > >>> on. And what I could do to prevent this from happening. Does
    anyone have
     > >>> any ideas?
     > >>>
     > >>> Regards,
     > >>> Jerven Bolleman
     > >>>
     > >>>
     > >>> Exception in thread "Lucene Merge Thread #202"
     > >>> org.apache.lucene.index.MergePolicy$MergeException:
     > >>> java.lang.ArithmeticException: integer overflow
     > >>> at
     > >>>
    
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735)
     > >>> at
     > >>>
    
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)
     > >>> Caused by: java.lang.ArithmeticException: integer overflow
     > >>> at java.base/java.lang.Math.toIntExact(Math.java:1135)
     > >>> at
     > >>>
    org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354)
     > >>> at
     > >>>
    
org.apache.lucene.codecs.lucene99.Lucene99PostingsWriter.finishTerm(Lucene99PostingsWriter.java:379)
     > >>> at
     > >>>
    
org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:173)
     > >>> at
     > >>>
    
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1097)
     > >>> at
     > >>>
    
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:398)
     > >>> at
    org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95)
     > >>> at
     > >>>
    
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205)
     > >>> at
     > >>>
    org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:209)
     > >>> at
     > >>>
    
org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298)
     > >>> at
    org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:137)
     > >>> at
     > >>>
    org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5252)
     > >>> at
    org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4740)
     > >>> at
     > >>>
    
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6541)
     > >>> at
     > >>>
    
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639)
     > >>> at
     > >>>
    
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:700)
     > >>>
     > >>>
    ---------------------------------------------------------------------
     > >>> To unsubscribe, e-mail:
    java-user-unsubscr...@lucene.apache.org
    <mailto:java-user-unsubscr...@lucene.apache.org>
     > >>> For additional commands, e-mail:
    java-user-h...@lucene.apache.org
    <mailto:java-user-h...@lucene.apache.org>
     > >>>
     > >>
     > >>
    ---------------------------------------------------------------------
     > >> To unsubscribe, e-mail:
    java-user-unsubscr...@lucene.apache.org
    <mailto:java-user-unsubscr...@lucene.apache.org>
     > >> For additional commands, e-mail:
    java-user-h...@lucene.apache.org
    <mailto:java-user-h...@lucene.apache.org>
     > >>
     > >
     > >
    ---------------------------------------------------------------------
     > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
    <mailto:java-user-unsubscr...@lucene.apache.org>
     > > For additional commands, e-mail:
    java-user-h...@lucene.apache.org
    <mailto:java-user-h...@lucene.apache.org>
     > >
     >
     > ---------------------------------------------------------------------
     > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
    <mailto:java-user-unsubscr...@lucene.apache.org>
     > For additional commands, e-mail: java-user-h...@lucene.apache.org
    <mailto:java-user-h...@lucene.apache.org>
     >

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
    <mailto:java-user-unsubscr...@lucene.apache.org>
    For additional commands, e-mail: java-user-h...@lucene.apache.org
    <mailto:java-user-h...@lucene.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to