[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
[ https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719221#comment-16719221 ] Christian Ziech commented on LUCENE-8606: - Those two failing tests are now also fixed, but I'm not sure if I did so in the proper way. Someone with more insight on what the SpanWeight should return in explain() if the simScorer field is null should have a look. > ConstantScoreQuery looses explain details of wrapped query > -- > > Key: LUCENE-8606 > URL: https://issues.apache.org/jira/browse/LUCENE-8606 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Christian Ziech >Priority: Major > Attachments: > 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, > 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch > > > Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not > adding the details of the wrapped query to the explanation. > {code} > if (exists) { > return Explanation.match(score, getQuery().toString() + (score == 1f ? "" > : "^" + score)); > } else { > return Explanation.noMatch(getQuery().toString() + " doesn't match id " + > doc); > } > {code} > This is kind of inconvenient as it makes it kind of hard to figure out which > term finally really matched when one e.g. puts a BooleanQuery into the FILTER > clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
[ https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Ziech updated LUCENE-8606: Attachment: (was: 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch) > ConstantScoreQuery looses explain details of wrapped query > -- > > Key: LUCENE-8606 > URL: https://issues.apache.org/jira/browse/LUCENE-8606 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Christian Ziech >Priority: Major > Attachments: > 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, > 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch > > > Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not > adding the details of the wrapped query to the explanation. > {code} > if (exists) { > return Explanation.match(score, getQuery().toString() + (score == 1f ? "" > : "^" + score)); > } else { > return Explanation.noMatch(getQuery().toString() + " doesn't match id " + > doc); > } > {code} > This is kind of inconvenient as it makes it kind of hard to figure out which > term finally really matched when one e.g. puts a BooleanQuery into the FILTER > clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
[ https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Ziech updated LUCENE-8606: Attachment: 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch > ConstantScoreQuery looses explain details of wrapped query > -- > > Key: LUCENE-8606 > URL: https://issues.apache.org/jira/browse/LUCENE-8606 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Christian Ziech >Priority: Major > Attachments: > 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, > 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch > > > Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not > adding the details of the wrapped query to the explanation. > {code} > if (exists) { > return Explanation.match(score, getQuery().toString() + (score == 1f ? "" > : "^" + score)); > } else { > return Explanation.noMatch(getQuery().toString() + " doesn't match id " + > doc); > } > {code} > This is kind of inconvenient as it makes it kind of hard to figure out which > term finally really matched when one e.g. puts a BooleanQuery into the FILTER > clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
[ https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719183#comment-16719183 ] Christian Ziech commented on LUCENE-8606: - Attached a new patch that fixes all but 2 test failures: {noformat} [junit4] Tests with failures [seed: 9F8CCC24EB4194B4]: [junit4] - org.apache.lucene.search.TestComplexExplanations.test2 [junit4] - org.apache.lucene.search.TestComplexExplanationsOfNonMatches.test2 {noformat} Those two failures are both caused by a NPE in the LeafSimScorer which is caused by the SpanWeight trying to explain a result with a "null" scorer. Also I had to include a kind of controversial change in the patch which removes the assertion "assert scoreMode.needsScores()" from the score() method of the AssertingScorer. The problem is that the explain method of the BooleanQuery is invoking the score() function to fill the value of the Explanation() object and if that BooleanQuery is explained in the context of a ConstantScoreQuery, this assertion would fire. I first tried to compute the value of the Explanation based on the detail explanations in the BooleanQuery, but that didn't quite add up due to double/float inaccuracies. > ConstantScoreQuery looses explain details of wrapped query > -- > > Key: LUCENE-8606 > URL: https://issues.apache.org/jira/browse/LUCENE-8606 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Christian Ziech >Priority: Major > Attachments: > 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, > 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch > > > Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not > adding the details of the wrapped query to the explanation. > {code} > if (exists) { > return Explanation.match(score, getQuery().toString() + (score == 1f ? "" > : "^" + score)); > } else { > return Explanation.noMatch(getQuery().toString() + " doesn't match id " + > doc); > } > {code} > This is kind of inconvenient as it makes it kind of hard to figure out which > term finally really matched when one e.g. puts a BooleanQuery into the FILTER > clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
[ https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Ziech updated LUCENE-8606: Attachment: (was: 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch) > ConstantScoreQuery looses explain details of wrapped query > -- > > Key: LUCENE-8606 > URL: https://issues.apache.org/jira/browse/LUCENE-8606 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Christian Ziech >Priority: Major > Attachments: > 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, > 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch > > > Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not > adding the details of the wrapped query to the explanation. > {code} > if (exists) { > return Explanation.match(score, getQuery().toString() + (score == 1f ? "" > : "^" + score)); > } else { > return Explanation.noMatch(getQuery().toString() + " doesn't match id " + > doc); > } > {code} > This is kind of inconvenient as it makes it kind of hard to figure out which > term finally really matched when one e.g. puts a BooleanQuery into the FILTER > clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
[ https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Ziech updated LUCENE-8606: Attachment: 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch > ConstantScoreQuery looses explain details of wrapped query > -- > > Key: LUCENE-8606 > URL: https://issues.apache.org/jira/browse/LUCENE-8606 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Christian Ziech >Priority: Major > Attachments: > 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, > 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch > > > Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not > adding the details of the wrapped query to the explanation. > {code} > if (exists) { > return Explanation.match(score, getQuery().toString() + (score == 1f ? "" > : "^" + score)); > } else { > return Explanation.noMatch(getQuery().toString() + " doesn't match id " + > doc); > } > {code} > This is kind of inconvenient as it makes it kind of hard to figure out which > term finally really matched when one e.g. puts a BooleanQuery into the FILTER > clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
[ https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719068#comment-16719068 ] Christian Ziech commented on LUCENE-8606: - Getting the tests you mentioned to work is the easy part. The harder part is that explaining a BooleanWeight which was created with "needsScores == false" is running into assertions... > ConstantScoreQuery looses explain details of wrapped query > -- > > Key: LUCENE-8606 > URL: https://issues.apache.org/jira/browse/LUCENE-8606 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Christian Ziech >Priority: Major > Attachments: > 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, > 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch > > > Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not > adding the details of the wrapped query to the explanation. > {code} > if (exists) { > return Explanation.match(score, getQuery().toString() + (score == 1f ? "" > : "^" + score)); > } else { > return Explanation.noMatch(getQuery().toString() + " doesn't match id " + > doc); > } > {code} > This is kind of inconvenient as it makes it kind of hard to figure out which > term finally really matched when one e.g. puts a BooleanQuery into the FILTER > clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
[ https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719035#comment-16719035 ] Christian Ziech commented on LUCENE-8606: - I'm running them right now - somehow I remembered that there was a jenkins job that would auto-verify patches in tickets. But most likely I remembered that wrong. > ConstantScoreQuery looses explain details of wrapped query > -- > > Key: LUCENE-8606 > URL: https://issues.apache.org/jira/browse/LUCENE-8606 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Christian Ziech >Priority: Major > Attachments: > 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, > 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch > > > Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not > adding the details of the wrapped query to the explanation. > {code} > if (exists) { > return Explanation.match(score, getQuery().toString() + (score == 1f ? "" > : "^" + score)); > } else { > return Explanation.noMatch(getQuery().toString() + " doesn't match id " + > doc); > } > {code} > This is kind of inconvenient as it makes it kind of hard to figure out which > term finally really matched when one e.g. puts a BooleanQuery into the FILTER > clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
[ https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719028#comment-16719028 ] Christian Ziech commented on LUCENE-8606: - Sure - that definitely makes the patch smaller. I attached [^0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch] which performs the change only for the ConstantScoreQuery and the CachingWrapperWeight. This adds a small amount of code duplication though. > ConstantScoreQuery looses explain details of wrapped query > -- > > Key: LUCENE-8606 > URL: https://issues.apache.org/jira/browse/LUCENE-8606 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Christian Ziech >Priority: Major > Attachments: > 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, > 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch > > > Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not > adding the details of the wrapped query to the explanation. > {code} > if (exists) { > return Explanation.match(score, getQuery().toString() + (score == 1f ? "" > : "^" + score)); > } else { > return Explanation.noMatch(getQuery().toString() + " doesn't match id " + > doc); > } > {code} > This is kind of inconvenient as it makes it kind of hard to figure out which > term finally really matched when one e.g. puts a BooleanQuery into the FILTER > clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
[ https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Ziech updated LUCENE-8606: Attachment: 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch > ConstantScoreQuery looses explain details of wrapped query > -- > > Key: LUCENE-8606 > URL: https://issues.apache.org/jira/browse/LUCENE-8606 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Christian Ziech >Priority: Major > Attachments: > 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, > 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch > > > Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not > adding the details of the wrapped query to the explanation. > {code} > if (exists) { > return Explanation.match(score, getQuery().toString() + (score == 1f ? "" > : "^" + score)); > } else { > return Explanation.noMatch(getQuery().toString() + " doesn't match id " + > doc); > } > {code} > This is kind of inconvenient as it makes it kind of hard to figure out which > term finally really matched when one e.g. puts a BooleanQuery into the FILTER > clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
[ https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719003#comment-16719003 ] Christian Ziech commented on LUCENE-8606: - [~jpountz] sure I could also just overwrite the CachingWrapperWeight of the LRUQueryCache and the anonymous inner class of the ConstantScoreQuery ... as you prefer. > ConstantScoreQuery looses explain details of wrapped query > -- > > Key: LUCENE-8606 > URL: https://issues.apache.org/jira/browse/LUCENE-8606 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Christian Ziech >Priority: Major > Attachments: > 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch > > > Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not > adding the details of the wrapped query to the explanation. > {code} > if (exists) { > return Explanation.match(score, getQuery().toString() + (score == 1f ? "" > : "^" + score)); > } else { > return Explanation.noMatch(getQuery().toString() + " doesn't match id " + > doc); > } > {code} > This is kind of inconvenient as it makes it kind of hard to figure out which > term finally really matched when one e.g. puts a BooleanQuery into the FILTER > clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
[ https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718830#comment-16718830 ] Christian Ziech edited comment on LUCENE-8606 at 12/12/18 11:56 AM: Sure I would be happy to supply a change but I think I'd first like to agree on an approach. The main challenge is imo that the ConstantScoreWeigth is subclassed at a lot of places in- and outside of lucene and changing the Constructor signature would be possibly problematic. So I'd add another constructor that would pass in the wrapped Weigth. So the behavior would be improved if the wrapped weight is passed in, and would remain unchanged if that weight is missing. A fix that changes all usages of the ConstantScoreWeigth is possible too but that would break compatibility (as I'd replace the Query parameter with a Weight parameter instead of just adding an alternative constructor) ... was (Author: christianziech): Sure I would be happy to supply a change but I think I'd first like to agree on an approach. The main challenge is imo that the ConstantScoreWeigth is subclassed at a lot of places in an outside of lucene and changing the Constructor signature would be possibly problematic. So I'd add another constructor that would pass in the wrapped Weigth. So the behavior would be improved if the wrapped weight is passed in, and would remain unchanged if that weight is missing. A fix that changes all usages of the ConstantScoreWeigth is possible too but that would break compatibility (as I'd replace the Query parameter with a Weight parameter instead of just adding an alternative constructor) ... > ConstantScoreQuery looses explain details of wrapped query > -- > > Key: LUCENE-8606 > URL: https://issues.apache.org/jira/browse/LUCENE-8606 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Christian Ziech >Priority: Major > > Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not > adding the details of the wrapped query to the explanation. > {code} > if (exists) { > return Explanation.match(score, getQuery().toString() + (score == 1f ? "" > : "^" + score)); > } else { > return Explanation.noMatch(getQuery().toString() + " doesn't match id " + > doc); > } > {code} > This is kind of inconvenient as it makes it kind of hard to figure out which > term finally really matched when one e.g. puts a BooleanQuery into the FILTER > clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
[ https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718830#comment-16718830 ] Christian Ziech commented on LUCENE-8606: - Sure I would be happy to supply a change but I think I'd first like to agree on an approach. The main challenge is imo that the ConstantScoreWeigth is subclassed at a lot of places in an outside of lucene and changing the Constructor signature would be possibly problematic. So I'd add another constructor that would pass in the wrapped Weigth. So the behavior would be improved if the wrapped weight is passed in, and would remain unchanged if that weight is missing. A fix that changes all usages of the ConstantScoreWeigth is possible too but that would break compatibility (as I'd replace the Query parameter with a Weight parameter instead of just adding an alternative constructor) ... > ConstantScoreQuery looses explain details of wrapped query > -- > > Key: LUCENE-8606 > URL: https://issues.apache.org/jira/browse/LUCENE-8606 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Christian Ziech >Priority: Major > > Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not > adding the details of the wrapped query to the explanation. > {code} > if (exists) { > return Explanation.match(score, getQuery().toString() + (score == 1f ? "" > : "^" + score)); > } else { > return Explanation.noMatch(getQuery().toString() + " doesn't match id " + > doc); > } > {code} > This is kind of inconvenient as it makes it kind of hard to figure out which > term finally really matched when one e.g. puts a BooleanQuery into the FILTER > clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query
Christian Ziech created LUCENE-8606: --- Summary: ConstantScoreQuery looses explain details of wrapped query Key: LUCENE-8606 URL: https://issues.apache.org/jira/browse/LUCENE-8606 Project: Lucene - Core Issue Type: Improvement Reporter: Christian Ziech Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not adding the details of the wrapped query to the explanation. {code} if (exists) { return Explanation.match(score, getQuery().toString() + (score == 1f ? "" : "^" + score)); } else { return Explanation.noMatch(getQuery().toString() + " doesn't match id " + doc); } {code} This is kind of inconvenient as it makes it kind of hard to figure out which term finally really matched when one e.g. puts a BooleanQuery into the FILTER clause of another BooleanQuery. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8219) LevenshteinAutomata should estimate the number of states and transitions
Christian Ziech created LUCENE-8219: --- Summary: LevenshteinAutomata should estimate the number of states and transitions Key: LUCENE-8219 URL: https://issues.apache.org/jira/browse/LUCENE-8219 Project: Lucene - Core Issue Type: Improvement Reporter: Christian Ziech Currently the toAutomaton() method of the LevenshteinAutomata class uses the default constructor of the Automaton although it exactly knows how many states the automaton will have and can do a reasonable estimation on how many transitions it will need as well. I suggest changing the lines {code:language=java|firstline=154|linenumbers=true} // the number of states is based on the length of the word and n int numStates = description.size(); Automaton a = new Automaton(); int lastState; {code} to {code:language=java|firstline=154|linenumbers=true} // the number of states is based on the length of the word and n final int numStates = description.size(); final int numTransitions = numStates * Math.min(1 + 2 * n, alphabet.length); final int prefixStates = prefix != null ? prefix.codePointCount(0, prefix.length()) : 0; final Automaton a = new Automaton(numStates + prefixStates, numTransitions); int lastState; {code} For my test data this cut down on the total amount of memory needed for int arrays by factor 4. The estimation of "1 + 2 * editDistance" should maybe rather be replaced by a value coming from the ParametricDescription used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5875) Default page/block sizes in the FST package can cause OOMs
[ https://issues.apache.org/jira/browse/LUCENE-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092849#comment-14092849 ] Christian Ziech commented on LUCENE-5875: - Exactly - when the PagedGrowableWriter in the NodeHash used 127 bytes for a page things worked like a charm (with maxint as suffix length, doShareNonSingletonNodes set to true and both of the min suffix counts set to 0). What numbers are you interested in? With doShareSuffix enabled the FST takes 3.1 GB of disk space: I quickly fetched the following numbers: - arcCount: 561802889 - nodeCount: 291569846 - arcWithOutputCount: 201469018 While in theory the nodeCount should hence be lower than 2.1B I think we also got an exception when enabling packing. But I'm not sure if we tried it in conjunction with doShareSuffix. Default page/block sizes in the FST package can cause OOMs -- Key: LUCENE-5875 URL: https://issues.apache.org/jira/browse/LUCENE-5875 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.9 Reporter: Christian Ziech Priority: Minor We are building some fairly big FSTs (the biggest one having about 500M terms with an average of 20 characters per term) and that works very well so far. The problem is just that we can use neither the doShareSuffix nor the doPackFST option from the builder since both would cause us to get exceptions. One beeing an OOM and the other an IllegalArgumentException for a negative array size in ArrayUtil. The thing here is that we in theory still have far more than enough memory available but it seems that java for some reason cannot allocate byte or long arrays of the size the NodeHash needs (maybe fragmentation?). Reducing the constant in the NodeHash from 130 to e.g. 27 seems to fix the issue mostly. Could e.g. the Builder pass through its bytesPageBits to the NodeHash or could we get a custom parameter for that? The other problem we run into was a NegativeArraySizeException when we try to pack the FST. It seems that we overflowed to 0x8000. Unfortunately I accidentally overwrote that exception but I remember it was triggered by the GrowableWriter for the inCounts in line 728 of the FST. If it helps I can try to reproduce it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5875) Default page/block sizes in the FST package can cause OOMs
[ https://issues.apache.org/jira/browse/LUCENE-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092932#comment-14092932 ] Christian Ziech commented on LUCENE-5875: - {quote} Hmm, something else is wrong then ... or was this just an OOME? If not, can you reproduce the non-OOME when turning on packing despite node count being well below 2.1B? {quote} Sure - give me 1-2 days and I'll paste it here. Default page/block sizes in the FST package can cause OOMs -- Key: LUCENE-5875 URL: https://issues.apache.org/jira/browse/LUCENE-5875 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.9 Reporter: Christian Ziech Priority: Minor We are building some fairly big FSTs (the biggest one having about 500M terms with an average of 20 characters per term) and that works very well so far. The problem is just that we can use neither the doShareSuffix nor the doPackFST option from the builder since both would cause us to get exceptions. One beeing an OOM and the other an IllegalArgumentException for a negative array size in ArrayUtil. The thing here is that we in theory still have far more than enough memory available but it seems that java for some reason cannot allocate byte or long arrays of the size the NodeHash needs (maybe fragmentation?). Reducing the constant in the NodeHash from 130 to e.g. 27 seems to fix the issue mostly. Could e.g. the Builder pass through its bytesPageBits to the NodeHash or could we get a custom parameter for that? The other problem we run into was a NegativeArraySizeException when we try to pack the FST. It seems that we overflowed to 0x8000. Unfortunately I accidentally overwrote that exception but I remember it was triggered by the GrowableWriter for the inCounts in line 728 of the FST. If it helps I can try to reproduce it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5875) Default page/block sizes in the FST package can cause OOMs
Christian Ziech created LUCENE-5875: --- Summary: Default page/block sizes in the FST package can cause OOMs Key: LUCENE-5875 URL: https://issues.apache.org/jira/browse/LUCENE-5875 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.9 Reporter: Christian Ziech Priority: Minor We are building some fairly big FSTs (the biggest one having about 500M terms with an average of 20 characters per term) and that works very well so far. The problem is just that we can use neither the doShareSuffix nor the doPackFST option from the builder since both would cause us to get exceptions. One beeing an OOM and the other an IllegalArgumentException for a negative array size in ArrayUtil. The thing here is that we in theory still have far more than enough memory available but it seems that java for some reason cannot allocate byte or long arrays of the size the NodeHash needs (maybe fragmentation?). Reducing the constant in the NodeHash from 130 to e.g. 27 seems to fix the issue mostly. Could e.g. the Builder pass through its bytesPageBits to the NodeHash or could we get a custom parameter for that? The other problem we run into was a NegativeArraySizeException when we try to pack the FST. It seems that we overflowed to 0x8000. Unfortunately I accidentally overwrote that exception but I remember it was triggered by the GrowableWriter for the inCounts in line 728 of the FST. If it helps I can try to reproduce it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5875) Default page/block sizes in the FST package can cause OOMs
[ https://issues.apache.org/jira/browse/LUCENE-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089366#comment-14089366 ] Christian Ziech commented on LUCENE-5875: - Oh there is another OOM we get: At the time the exception was thrown we were indexing for 5-6 hours and have closed the IndexWriter already. Now we only want to store the special terms we gathered during indexing into a custom FST. At the point in time the Exception was thrown effectively one one thread was active in the VM the last attempt of a GC printed the following: Eden: 0B(4021M)-0B(4021M) Survivors: 75M-75M Heap: 9615M(30720M)-9615M(30720M) Those values are also pretty much in line with the numbers we get from the runtime if we add custom debug statements. java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.packed.Packed64.init(Packed64.java:73) at org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:1034) at org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:1001) at org.apache.lucene.util.packed.GrowableWriter.init(GrowableWriter.java:46) at org.apache.lucene.util.packed.GrowableWriter.resize(GrowableWriter.java:98) at org.apache.lucene.util.fst.FST.addNode(FST.java:845) at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:200) at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:289) at org.apache.lucene.util.fst.Builder.add(Builder.java:394) at com.nokia.search.candgen.spelling.AtomicFSTBuilder$FSTWriter.put(AtomicFSTBuilder.java:358) at com.nokia.search.candgen.spelling.AtomicFSTBuilder$WriteTask.run(AtomicFSTBuilder.java:156) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Default page/block sizes in the FST package can cause OOMs -- Key: LUCENE-5875 URL: https://issues.apache.org/jira/browse/LUCENE-5875 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.9 Reporter: Christian Ziech Priority: Minor We are building some fairly big FSTs (the biggest one having about 500M terms with an average of 20 characters per term) and that works very well so far. The problem is just that we can use neither the doShareSuffix nor the doPackFST option from the builder since both would cause us to get exceptions. One beeing an OOM and the other an IllegalArgumentException for a negative array size in ArrayUtil. The thing here is that we in theory still have far more than enough memory available but it seems that java for some reason cannot allocate byte or long arrays of the size the NodeHash needs (maybe fragmentation?). Reducing the constant in the NodeHash from 130 to e.g. 27 seems to fix the issue mostly. Could e.g. the Builder pass through its bytesPageBits to the NodeHash or could we get a custom parameter for that? The other problem we run into was a NegativeArraySizeException when we try to pack the FST. It seems that we overflowed to 0x8000. Unfortunately I accidentally overwrote that exception but I remember it was triggered by the GrowableWriter for the inCounts in line 728 of the FST. If it helps I can try to reproduce it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5875) Default page/block sizes in the FST package can cause OOMs
[ https://issues.apache.org/jira/browse/LUCENE-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089366#comment-14089366 ] Christian Ziech edited comment on LUCENE-5875 at 8/7/14 3:54 PM: - Oh there is another OOM we get: At the time the exception was thrown we were indexing for 5-6 hours and have closed the IndexWriter already. Now we only want to store the special terms we gathered during indexing into a custom FST. At the point in time the Exception was thrown effectively one one thread was active in the VM the last attempt of a GC printed the following: Eden: 0B(4021M)-0B(4021M) Survivors: 75M-75M Heap: 9615M(30720M)-9615M(30720M) Those values are also pretty much in line with the numbers we get from the runtime if we add custom debug statements. {code} java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.packed.Packed64.init(Packed64.java:73) at org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:1034) at org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:1001) at org.apache.lucene.util.packed.GrowableWriter.init(GrowableWriter.java:46) at org.apache.lucene.util.packed.GrowableWriter.resize(GrowableWriter.java:98) at org.apache.lucene.util.fst.FST.addNode(FST.java:845) at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:200) at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:289) at org.apache.lucene.util.fst.Builder.add(Builder.java:394) at com.nokia.search.candgen.spelling.AtomicFSTBuilder$FSTWriter.put(AtomicFSTBuilder.java:358) at com.nokia.search.candgen.spelling.AtomicFSTBuilder$WriteTask.run(AtomicFSTBuilder.java:156) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) {code} was (Author: christianz): Oh there is another OOM we get: At the time the exception was thrown we were indexing for 5-6 hours and have closed the IndexWriter already. Now we only want to store the special terms we gathered during indexing into a custom FST. At the point in time the Exception was thrown effectively one one thread was active in the VM the last attempt of a GC printed the following: Eden: 0B(4021M)-0B(4021M) Survivors: 75M-75M Heap: 9615M(30720M)-9615M(30720M) Those values are also pretty much in line with the numbers we get from the runtime if we add custom debug statements. java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.packed.Packed64.init(Packed64.java:73) at org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:1034) at org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:1001) at org.apache.lucene.util.packed.GrowableWriter.init(GrowableWriter.java:46) at org.apache.lucene.util.packed.GrowableWriter.resize(GrowableWriter.java:98) at org.apache.lucene.util.fst.FST.addNode(FST.java:845) at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:200) at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:289) at org.apache.lucene.util.fst.Builder.add(Builder.java:394) at com.nokia.search.candgen.spelling.AtomicFSTBuilder$FSTWriter.put(AtomicFSTBuilder.java:358) at com.nokia.search.candgen.spelling.AtomicFSTBuilder$WriteTask.run(AtomicFSTBuilder.java:156) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Default page/block sizes in the FST package can cause OOMs -- Key: LUCENE-5875 URL: https://issues.apache.org/jira/browse/LUCENE-5875 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.9 Reporter: Christian Ziech Priority: Minor We are building some fairly big FSTs (the biggest one having about 500M terms with an average of 20 characters per term) and that works very well so far. The problem is just that we can use neither the doShareSuffix nor the doPackFST option from the builder since both would cause us to get exceptions. One beeing an OOM and the other an IllegalArgumentException for a negative array size in ArrayUtil. The thing here is that we in theory still have far more than enough memory available but it seems that java for some reason cannot allocate byte or long arrays of the size the NodeHash needs (maybe fragmentation?). Reducing the constant in the NodeHash from 130 to e.g. 27
[jira] [Updated] (LUCENE-5670) org.apache.lucene.util.fst.FST should skip over outputs it is not interested in
[ https://issues.apache.org/jira/browse/LUCENE-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Ziech updated LUCENE-5670: Attachment: skipOutput_lucene48.patch Attaching a patch relative to the lucene 4.8 branch. org.apache.lucene.util.fst.FST should skip over outputs it is not interested in --- Key: LUCENE-5670 URL: https://issues.apache.org/jira/browse/LUCENE-5670 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.7 Reporter: Christian Ziech Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-5670.patch, skipOutput_lucene48.patch Currently the FST uses the read(DataInput) method from the Outputs class to skip over outputs it actually is not interested in. For most use cases this just creates some additional objects that are immediately destroyed again. When traversing an FST with non-trivial data however this can easily add up to several excess objects that nobody actually ever read. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST
[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998560#comment-13998560 ] Christian Ziech commented on LUCENE-5584: - The additional ctor would be a solution as well, yes. We then could keep the FSTs in some cache and use one per thread. Allow FST read method to also recycle the output value when traversing FST -- Key: LUCENE-5584 URL: https://issues.apache.org/jira/browse/LUCENE-5584 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.7.1 Reporter: Christian Ziech Attachments: fst-itersect-benchmark.tgz The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T readFinalOutput(DataInput in, T reuse) throws IOException { return read(in, reuse); } {code} The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method. If you should decide to make that change I'd be happy to supply a patch and/or tests for the feature. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5670) org.apache.lucene.util.fst.FST should skip over outputs it is not interested in
[ https://issues.apache.org/jira/browse/LUCENE-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997744#comment-13997744 ] Christian Ziech edited comment on LUCENE-5670 at 5/14/14 4:53 PM: -- Oh right! I only checked the 4.7 branch and there the DataInput didn't have the skipBytes() method yet. But now I saw that both trunk and the 4.8 branch have the skipBytes(long) already. So yes of course in that case we can drop it from the patch. If we can get consensus that the rest of the patch is worth doing I could implement it against 4.8 and attach it here. Edit: The ticket that added the skipBytes to the DataInput was LUCENE-5583 was (Author: christianz): Oh right! I only checked the 4.7 branch and there the DataInput didn't have the skipBytes() method yet. But now I saw that both trunk and the 4.8 branch have the skipBytes(long) already. So yes of course in that case we can drop it from the patch. If we can get consensus that the rest of the patch is worth doing I could implement it against 4.8 and attach it here. org.apache.lucene.util.fst.FST should skip over outputs it is not interested in --- Key: LUCENE-5670 URL: https://issues.apache.org/jira/browse/LUCENE-5670 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.7 Reporter: Christian Ziech Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-5670.patch Currently the FST uses the read(DataInput) method from the Outputs class to skip over outputs it actually is not interested in. For most use cases this just creates some additional objects that are immediately destroyed again. When traversing an FST with non-trivial data however this can easily add up to several excess objects that nobody actually ever read. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5670) org.apache.lucene.util.fst.FST should skip over outputs it is not interested in
[ https://issues.apache.org/jira/browse/LUCENE-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997633#comment-13997633 ] Christian Ziech commented on LUCENE-5670: - No actually only some of the subclasses of DataInput had a skipBytes() implementation - e.g. the BytesReader() intermediate abstract class added it to the interface and also the ByteArrayDataInput had it before. Maybe one should scan over all the other implementations if they had a similar method that was just named differently or could implement it (e.g. IndexInput could easily implement the skip method as a comination of seek and getFilePointer). org.apache.lucene.util.fst.FST should skip over outputs it is not interested in --- Key: LUCENE-5670 URL: https://issues.apache.org/jira/browse/LUCENE-5670 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.7 Reporter: Christian Ziech Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-5670.patch Currently the FST uses the read(DataInput) method from the Outputs class to skip over outputs it actually is not interested in. For most use cases this just creates some additional objects that are immediately destroyed again. When traversing an FST with non-trivial data however this can easily add up to several excess objects that nobody actually ever read. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5670) org.apache.lucene.util.fst.FST should skip over outputs it is not interested in
Christian Ziech created LUCENE-5670: --- Summary: org.apache.lucene.util.fst.FST should skip over outputs it is not interested in Key: LUCENE-5670 URL: https://issues.apache.org/jira/browse/LUCENE-5670 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.7 Reporter: Christian Ziech Priority: Minor Currently the FST uses the read(DataInput) method from the Outputs class to skip over outputs it actually is not interested in. For most use cases this just creates some additional objects that are immediately destroyed again. When traversing an FST with non-trivial data however this can easily add up to several excess objects that nobody actually ever read. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5670) org.apache.lucene.util.fst.FST should skip over outputs it is not interested in
[ https://issues.apache.org/jira/browse/LUCENE-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997744#comment-13997744 ] Christian Ziech commented on LUCENE-5670: - Oh right! I only checked the 4.7 branch and there the DataInput didn't have the skipBytes() method yet. But now I saw that both trunk and the 4.8 branch have the skipBytes(long) already. So yes of course in that case we can drop it from the patch. If we can get consensus that the rest of the patch is worth doing I could implement it against 4.8 and attach it here. org.apache.lucene.util.fst.FST should skip over outputs it is not interested in --- Key: LUCENE-5670 URL: https://issues.apache.org/jira/browse/LUCENE-5670 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.7 Reporter: Christian Ziech Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-5670.patch Currently the FST uses the read(DataInput) method from the Outputs class to skip over outputs it actually is not interested in. For most use cases this just creates some additional objects that are immediately destroyed again. When traversing an FST with non-trivial data however this can easily add up to several excess objects that nobody actually ever read. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5670) org.apache.lucene.util.fst.FST should skip over outputs it is not interested in
[ https://issues.apache.org/jira/browse/LUCENE-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Ziech updated LUCENE-5670: Attachment: LUCENE-5670.patch Attached an (untested) patch where a skipOutput method is added to the outputs which doesn't create excess objects. Default implementation is the current behavior by invoking the read() method. Also a skipBytes(int) method was added to the DataInput which defaults to reading the data as before. Several implementations of the DataInput already had a skipBytes() method and now effectively implement it. org.apache.lucene.util.fst.FST should skip over outputs it is not interested in --- Key: LUCENE-5670 URL: https://issues.apache.org/jira/browse/LUCENE-5670 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.7 Reporter: Christian Ziech Priority: Minor Attachments: LUCENE-5670.patch Currently the FST uses the read(DataInput) method from the Outputs class to skip over outputs it actually is not interested in. For most use cases this just creates some additional objects that are immediately destroyed again. When traversing an FST with non-trivial data however this can easily add up to several excess objects that nobody actually ever read. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST
[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996249#comment-13996249 ] Christian Ziech commented on LUCENE-5584: - {quote} I'm also confused on why a custom Outputs impl that secretly reuses isn't sufficient here. {quote} The problem is that Outputs.add() is just one of the locations that would create a new output-value instance. We do the reusing you suggested for the target of the add operation in our implementation and it works fine but it only fixes 50% of the allocation sites. The other half of the sites is in the Outputs.read() method. To fix this site was the reason for this ticket. Basically it would be fine for us that if every call to Outputs.read() would return the very same output-value since we only use it to add to the result we build internally (and throw it away thereafter). Since however multiple threads use the same FST in parallel this can only be achieved by a hack that uses a ThreadLocal to prevent threading issues with multiple threads using the same output-value instance. This also has the stench of adding a (hidden) state to a supposedly stateless class. Allow FST read method to also recycle the output value when traversing FST -- Key: LUCENE-5584 URL: https://issues.apache.org/jira/browse/LUCENE-5584 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.7.1 Reporter: Christian Ziech Attachments: fst-itersect-benchmark.tgz The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T readFinalOutput(DataInput in, T reuse) throws IOException { return read(in, reuse); } {code} The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method. If you should decide to make that change I'd be happy to supply a patch and/or tests for the feature. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST
[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994934#comment-13994934 ] Christian Ziech commented on LUCENE-5584: - Optimizing our code for intersecting an automaton with an FST (inspired by org.apache.lucene.search.suggest.analyzing.FSTUtil#intersectPrefixPaths) I came across the following locations that create objects that actually could do without: - the scratchArc is created for every node in the automaton - for every state in the Automaton an iterator is created implicitly when iterating over the Transitions of the state - outputs.add() creates a new outputs value object for every state of the automaton if the corresponding FST state had an output - for every transition visited a new IntsRef instance is created - for every FST node read a new outputs value object is created All except the last allocation location was fixed easily: - we keep the scratch arcs in a Stack and hence only create one per level of the automaton (about 10-15 levels for us) - we iterate over the states using an int index in the transitions array - we replaced outputs add by our own method that just appends the outputs of the FST Arc to a single outputs value per intersect call and then upon exiting the recursion just removes it again - same goes for the input IntsRef - we have one instance that is just modified as we traverse the automaton/FST For the last allocation location we now have gone with a special Outputs implementation that uses a rather ugly construct to always return the very same outputs instance for the iterate case per Thread. Thinking about the problem again I came to think of another (easier) solution to that problem. If the outputs of the FST wouldn't actually be a field of the FST itself but if they would be under control of the caller of the FST read*Arc methods just like the BytesReader is, we wouldn't have the problem (maybe instead of the BytesReader). That way we could just create a new Outputs instance for each of our intersection runs and wouldn't need to resort that construct which attaches a state to something that is not meant to have a state. Allow FST read method to also recycle the output value when traversing FST -- Key: LUCENE-5584 URL: https://issues.apache.org/jira/browse/LUCENE-5584 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.7.1 Reporter: Christian Ziech Attachments: fst-itersect-benchmark.tgz The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T readFinalOutput(DataInput in, T reuse) throws IOException { return read(in, reuse); } {code} The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method. If you should decide to make that change I'd be happy to supply a patch and/or tests for the feature. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST
[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Ziech updated LUCENE-5584: Attachment: fst-itersect-benchmark.tgz Sorry it took me quite some time to assemble a benchmark that I can attach here. I basically copied the FST and other required classes to the project and modified the FST so that it supports reusing of value object of Arcs. As said: the overhead with the long instances is quite manageable (about 3-5% for the overall execution time in the benchmark) - but there is still some. The main advantage we get from putting the values into the FST was the total space requirement. Right now I don't see an easy way to implement the prefix compression without the FST ourselves - but I need to think about that more carefully. To run the benchmark yourself you first have to generate a FST using the RandomFstBuilder class and afterwards you can use the BenchmarkMutableFst class to execute the benchmark. PS: I also included a variant that uses the intersect implementation of the analyzers package but I don't think that the numbers are really a fair comparison since the analyzers variant takes a shortcut we afaik cannot take with our automaton. Allow FST read method to also recycle the output value when traversing FST -- Key: LUCENE-5584 URL: https://issues.apache.org/jira/browse/LUCENE-5584 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.7.1 Reporter: Christian Ziech Attachments: fst-itersect-benchmark.tgz The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T readFinalOutput(DataInput in, T reuse) throws IOException { return read(in, reuse); } {code} The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method. If you should decide to make that change I'd be happy to supply a patch and/or tests for the feature. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST
[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967752#comment-13967752 ] Christian Ziech commented on LUCENE-5584: - {quote} If this is the case, then I am not sure you are using the correct datastructure: it seems to me that a byte sequence output is not appropriate. Since you do not care about the intermediate outputs, but have a complicated intersection with the FST, why not use a numeric output, pointing to the result data somewhere else? {quote} That is what we do right now. This however has the downside that we loose the prefix compression capability of the FST for the FST values which is significant in our case. The single FST with the values attached was roughly 1.2G large and now with the referenced byte arrays (we load them into a DirectByteBuffer) we spend about 2.5G for the values alone. Of course we could try to implement the same prefix compression as the FST does on our own and fill a byte array while traversing the FST but that feels like copying something that is already almost there. If we could just get the extension points I mentioned into Lucene without actually changing the actual behavior of (most or any) of lucenes code that would be a huge help. Allow FST read method to also recycle the output value when traversing FST -- Key: LUCENE-5584 URL: https://issues.apache.org/jira/browse/LUCENE-5584 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.7.1 Reporter: Christian Ziech The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T readFinalOutput(DataInput in, T reuse) throws IOException { return read(in, reuse); } {code} The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method. If you should decide to make that change I'd be happy to supply a patch and/or tests for the feature. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST
[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967752#comment-13967752 ] Christian Ziech edited comment on LUCENE-5584 at 4/13/14 7:04 AM: -- {quote} If this is the case, then I am not sure you are using the correct datastructure: it seems to me that a byte sequence output is not appropriate. Since you do not care about the intermediate outputs, but have a complicated intersection with the FST, why not use a numeric output, pointing to the result data somewhere else? {quote} That is what we do right now. This however has the downside that we loose the prefix compression capability of the FST for the FST values which is significant in our case. The single FST with the values attached was roughly 1.2G large and now with the referenced byte arrays (we load them into a DirectByteBuffer) we spend about 2.5G for the values alone. Of course we could try to implement the same prefix compression as the FST does on our own and fill a byte array while traversing the FST but that feels like copying something that is already almost there. If we could just get the extension points I mentioned into Lucene without actually changing the actual behavior of (most or any) of lucenes code that would be a huge help. Edit: Also with numeric outputs we still suffer from quite a few unwanted long references that are created temporarily by the VM just as the byte arrays were before. This problem is far less severe and actually manageable though. was (Author: christianz): {quote} If this is the case, then I am not sure you are using the correct datastructure: it seems to me that a byte sequence output is not appropriate. Since you do not care about the intermediate outputs, but have a complicated intersection with the FST, why not use a numeric output, pointing to the result data somewhere else? {quote} That is what we do right now. This however has the downside that we loose the prefix compression capability of the FST for the FST values which is significant in our case. The single FST with the values attached was roughly 1.2G large and now with the referenced byte arrays (we load them into a DirectByteBuffer) we spend about 2.5G for the values alone. Of course we could try to implement the same prefix compression as the FST does on our own and fill a byte array while traversing the FST but that feels like copying something that is already almost there. If we could just get the extension points I mentioned into Lucene without actually changing the actual behavior of (most or any) of lucenes code that would be a huge help. Allow FST read method to also recycle the output value when traversing FST -- Key: LUCENE-5584 URL: https://issues.apache.org/jira/browse/LUCENE-5584 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.7.1 Reporter: Christian Ziech The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T readFinalOutput(DataInput in, T reuse) throws IOException { return read(in, reuse); } {code} The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method. If you should decide to make that change I'd be happy to supply a patch and/or tests for the feature. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST
[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966443#comment-13966443 ] Christian Ziech commented on LUCENE-5584: - Modifying the existing ByteSequenceOutputs and CharSequenceOutputs to actually modify the output passed in upon read turns out to be rather complex since a lot in lucene indexing is actually relying on the immutability of arc outputs. Would it be ok for you guys to have a new mutable ByteSequenceOutput that could then be used for e.g. the AnalyzingSuggester? Then the patch would mostly be adding the extension points required and only using them in the special cases that would benefit from it. Allow FST read method to also recycle the output value when traversing FST -- Key: LUCENE-5584 URL: https://issues.apache.org/jira/browse/LUCENE-5584 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.7.1 Reporter: Christian Ziech The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T readFinalOutput(DataInput in, T reuse) throws IOException { return read(in, reuse); } {code} The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method. If you should decide to make that change I'd be happy to supply a patch and/or tests for the feature. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST
[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965215#comment-13965215 ] Christian Ziech commented on LUCENE-5584: - Thx for the very quick and helpful replies. It seems that I owe you some more hard and concrete information on our use case, what we exactly do and our environment. About the environment - the tests were run with {quote} java version 1.7.0_45 OpenJDK Runtime Environment (rhel-2.4.3.4.el6_5-x86_64 u45-b15) OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode) {quote} on a CentOS 6.5. Our vm options don't enable the tlab right now but I'm definitely consider using it for other reasons. Currently we are running with the following (gc relevant) arguments: -Xmx6g -XX:MaxNewSize=700m -XX:+UseConcMarkSweepGC -XX:MaxDirectMemorySize=35g. I'm not so much worried about the get performance although that could be improved as well. We are using lucenes LevenshteinAutomata class to generate a couple of Levenshtein automatons with edit distance 1 or 2 (one for each search term), build the union of them and intersect them with our FST using a modified version of the method org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() which uses a callback method to push every matched entry instead of returning the whole list of paths (for efficiency reasons as well: we don't actually need the byte arrays but we want to parse them into a value object, hence reusing the output byte array is ok for us). Our FST has about 500M entires and each entry has a value of approx. 10-20 bytes. That produces for a random query with 4 terms (and hence a union of 4 levenshtein automatons) an amount of ~2M visited nodes with output (hence 2M created temporary byte []) and a total size ~7.5M for the temporary byte arrays (+ the overhead per instance). In that experiment I matched about 10k terms in the FST. Those numbers are taking into account that we already used our own add implementation that writes to always the same BytesRef instance when adding outputs. The overall impact on the GC and also the execution speed of the method was rather significant in total - I can try to dig up numbers for that but they would be rather application specific. Does this help and answers all the questions so far? Btw: Experimenting a little with the change I noticed that things may be a slightly more complicated since the output of a node is often overwritten with NO_OUTPUT from the Outputs - so that method would need to recycle the current output as well if possible but that may have interesting side effects - but hopefully that should be manageable. Allow FST read method to also recycle the output value when traversing FST -- Key: LUCENE-5584 URL: https://issues.apache.org/jira/browse/LUCENE-5584 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.7.1 Reporter: Christian Ziech The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T readFinalOutput(DataInput in, T reuse) throws IOException { return read(in, reuse); } {code} The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method. If you should decide to make that change I'd be happy to supply a patch and/or tests for the feature. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST
[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965316#comment-13965316 ] Christian Ziech commented on LUCENE-5584: - Trying to assemble the patch I came across the FST.Arc.copyFrom(Arc) method which unfortunately seems to implicitly assumes that the output of a node is immutable (which it would not be any longer). Is this immutability intended? If not I think that copyFrom() method would need to be moved into the FST class so that it can make use of the Outputs of the FST to clone the output of the copied arc if it is mutable ... however that would increase the size of the patch and possibly impact other users too ... Allow FST read method to also recycle the output value when traversing FST -- Key: LUCENE-5584 URL: https://issues.apache.org/jira/browse/LUCENE-5584 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.7.1 Reporter: Christian Ziech The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T readFinalOutput(DataInput in, T reuse) throws IOException { return read(in, reuse); } {code} The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method. If you should decide to make that change I'd be happy to supply a patch and/or tests for the feature. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST
Christian Ziech created LUCENE-5584: --- Summary: Allow FST read method to also recycle the output value when traversing FST Key: LUCENE-5584 URL: https://issues.apache.org/jira/browse/LUCENE-5584 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.7.1 Reporter: Christian Ziech The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T readFinalOutput(DataInput in, T reuse) throws IOException { return read(in, reuse); } {code} The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method. If you should decide to make that change I'd be happy to supply a patch and/or tests for feature. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST
[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Ziech updated LUCENE-5584: Description: The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T readFinalOutput(DataInput in, T reuse) throws IOException { return read(in, reuse); } {code} The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method. If you should decide to make that change I'd be happy to supply a patch and/or tests for the feature. was: The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link * #write(Object, DataOutput)} reusing the object passed in if possible */ public abstract T read(DataInput in, T reuse) throws IOException; /** Decode an output value previously written with {@link * #writeFinalOutput(Object, DataOutput)}. By default this * just calls {@link #read(DataInput)}. This tries to reuse the object * passed in if possible */ public T readFinalOutput(DataInput in, T reuse) throws IOException { return read(in, reuse); } {code} The new methods could then be used in the FST in the readNextRealArc() method passing in the output of the reused Arc. For most inputs they could even just invoke the original read(in) method. If you should decide to make that change I'd be happy to supply a patch and/or tests for feature. Allow FST read method to also recycle the output value when traversing FST -- Key: LUCENE-5584 URL: https://issues.apache.org/jira/browse/LUCENE-5584 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Affects Versions: 4.7.1 Reporter: Christian Ziech The FST class heavily reuses Arc instances when traversing the FST. The output of an Arc however is not reused. This can especially be important when traversing large portions of a FST and using the ByteSequenceOutputs and CharSequenceOutputs. Those classes create a new byte[] or char[] for every node read (which has an output). In our use case we intersect a lucene Automaton with a FSTBytesRef much like it is done in org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and since the Automaton and the FST are both rather large tens or even hundreds of thousands of temporary byte array objects are created. One possible solution to the problem would be to change the org.apache.lucene.util.fst.Outputs class to have two additional methods (if you don't want to change the existing methods for compatibility): {code} /** Decode an output value previously written with {@link *
[jira] [Commented] (LUCENE-4930) Lucene's use of WeakHashMap at index time prevents full use of cores on some multi-core machines, due to contention
[ https://issues.apache.org/jira/browse/LUCENE-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630329#comment-13630329 ] Christian Ziech commented on LUCENE-4930: - I have checked the java we are using and it is the lastest java 6 openjdk available for ubuntu 12.04 LTS (6b24-1.11.5-0ubuntu1~12.04.1). There is a newer one available for 12.10 (6b27-1.12.3-0ubuntu1~12.10.1) but still the issue is also not fixed in that version ... so there is no way around that code without upgrading to java 7 ... The code snippet from the reference queue looks like follows: {noformat} 89 /** 90 * Polls this queue to see if a reference object is available. If one is 91 * available without further delay then it is removed from the queue and 92 * returned. Otherwise this method immediately returns ttnull/tt. 93 * 94 * @return A reference object, if one was immediately available, 95 * otherwise codenull/code 96 */ 97 public Reference? extends T poll() { 98 synchronized (lock) { 99 return reallyPoll(); 100 } 101 } {noformat} So it seems that Uwe is right about our java version not doing the double checking here before actually entering the synchronized block. However I'm not really sure if I understand the reason for lucene to actually use a WeakKeyHashMap here: I may be wrong but wouldn't that reap actually only happen when the Interface class itself is unloaded? That should be an extremely rare thing, or? If I understand the purpose of that code correctly it is meant to prevent a memory wasting for cases where the user does incremental indexing from time to time. In that case the attribute source would prevent the interface class and implementation class from being garbage collected in the mean time. But is that case actually really worth the effort (I don't know how big the memory footprint for an Attribute implementation _class_ usually is)? I mean that would only affect the static fields here (and in plain lucene I could not find many of those) ... Lucene's use of WeakHashMap at index time prevents full use of cores on some multi-core machines, due to contention --- Key: LUCENE-4930 URL: https://issues.apache.org/jira/browse/LUCENE-4930 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 4.2 Environment: Dell blade system with 16 cores Reporter: Karl Wright Attachments: thread_dump.txt Our project is not optimally using full processing power during under indexing load on Lucene 4.2.0. The reason is the AttributeSource.addAttribute() method, which goes through a WeakHashMap synchronizer, which is apparently single-threaded for a significant amount of time. Have a look at the following trace: pool-1-thread-28 prio=10 tid=0x7f47fc104800 nid=0x672b waiting for monitor entry [0x7f47d19ed000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.ref.ReferenceQueue.poll(ReferenceQueue.java:98) - waiting to lock 0x0005c5cd9988 (a java.lang.ref.ReferenceQueue$Lock) at org.apache.lucene.util.WeakIdentityMap.reap(WeakIdentityMap.java:189) at org.apache.lucene.util.WeakIdentityMap.get(WeakIdentityMap.java:82) at org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.getClassForInterface(AttributeSource.java:74) at org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.createAttributeInstance(AttributeSource.java:65) at org.apache.lucene.util.AttributeSource.addAttribute(AttributeSource.java:271) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:107) at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:254) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1148) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1129) … We’ve had to make significant changes to the way we were indexing in order to not hit this issue as much, such as indexing using TokenStreams which we reuse, when it would have been more convenient to index with just tokens. (The reason is that Lucene internally creates TokenStream
[jira] [Commented] (LUCENE-4930) Lucene's use of WeakHashMap at index time prevents full use of cores on some multi-core machines, due to contention
[ https://issues.apache.org/jira/browse/LUCENE-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630542#comment-13630542 ] Christian Ziech commented on LUCENE-4930: - {quote} The issue is not class unloading in your own application while it is running. The VM will never do this. It will only unload classes, when the ClassLoader is released. This happens e.g. when you redeploy your webapplication in your Jetty or Tomcat container or (and this is the most important reason) when you reload Solr cores: If you have a custom analyzer JAR file in your plugins directory that uses custom attributes (like lucene-kuromoji.jar Japanese Analyzer), your would have a memory leak. Solr loads plugins in its own classloader. If you restart a core it reinitializes its plugins and releases the old classloader. If the AttributeSource would refer to these classes, they could never be unloaded. The same happens if you have a webapp that uses a lucene-core.jar file from outside the webapp (e.g. from Ubuntu repository in /usr/lib), but has own analyzers shipped in the webapp. In that case, the classes could not be unloaded on webapp shutdown. The WeakIdentityMap prevents this big resource leak (permgen issue). If you wonder: The values in the map also have a WeakReference, because the key's weak reference and the Map.Entry is only removed when you actually call get() on the map. If you unload the webapp, nobody calls get() anymore, so all Map.Entry would refer to the classes and are never removed. One optimization might be possible: As the number of classes in this map is very low and the important thing is to release the class reference when no longer needed, we could add an option to WeakIdentityMap to make reap() a no-op. This would keep the WeakReference and Map.Entrys in the map, but the classes could get freed. The small overhead (you can count the number of entries on your fingers) would be minimal and the lost WeakReferences in the map would be no problem. Another approach would be to make DefaultAttributeSource have a lookup table (without weak keys) on all Lucene-Internal attributes (which are the only ones actually used by IndexWriter). I would prefer this approach. {quote} I totally understood the problem with the unloading of the keys (but I think I worded it badly) - I just did not expect it to be grave since every reload would only leave behind two dead weak references and the related map entry. A possibly better option than making reap a no-op could be to only reap on put. I mean one usually invokes get() but once that event of unloading an interface actually happens and something new needs to be added one would reap the old keys (in worst case perhaps one unloading later). I also tried to think of a good way to have one AttributeFactory per class loader (I mean you only really have a problem if the class loader that loads the interface class is a child of the class loader that did load the the AttributeFactory class) but couldn't find one. Lucene's use of WeakHashMap at index time prevents full use of cores on some multi-core machines, due to contention --- Key: LUCENE-4930 URL: https://issues.apache.org/jira/browse/LUCENE-4930 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 4.2 Environment: Dell blade system with 16 cores Reporter: Karl Wright Attachments: thread_dump.txt Our project is not optimally using full processing power during under indexing load on Lucene 4.2.0. The reason is the AttributeSource.addAttribute() method, which goes through a WeakHashMap synchronizer, which is apparently single-threaded for a significant amount of time. Have a look at the following trace: pool-1-thread-28 prio=10 tid=0x7f47fc104800 nid=0x672b waiting for monitor entry [0x7f47d19ed000] java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.ref.ReferenceQueue.poll(ReferenceQueue.java:98) - waiting to lock 0x0005c5cd9988 (a java.lang.ref.ReferenceQueue$Lock) at org.apache.lucene.util.WeakIdentityMap.reap(WeakIdentityMap.java:189) at org.apache.lucene.util.WeakIdentityMap.get(WeakIdentityMap.java:82) at org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.getClassForInterface(AttributeSource.java:74) at org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.createAttributeInstance(AttributeSource.java:65) at org.apache.lucene.util.AttributeSource.addAttribute(AttributeSource.java:271) at