from:"Christian Ziech \(JIRA\)"

[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719221#comment-16719221
 ] 

Christian Ziech commented on LUCENE-8606:
-

Those two failing tests are now also fixed, but I'm not sure if I did so in the 
proper way. Someone with more insight on what the SpanWeight should return in 
explain() if the simScorer field is null should have a look.

> ConstantScoreQuery looses explain details of wrapped query
> --
>
> Key: LUCENE-8606
> URL: https://issues.apache.org/jira/browse/LUCENE-8606
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Christian Ziech
>Priority: Major
> Attachments: 
> 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, 
> 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch
>
>
> Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not 
> adding the details of the wrapped query to the explanation. 
> {code}
> if (exists) {
> return Explanation.match(score, getQuery().toString() + (score == 1f ? "" 
> : "^" + score));
> } else {
> return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
> doc);
> }
> {code}
> This is kind of inconvenient as it makes it kind of hard to figure out which 
> term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
> clause of another BooleanQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Ziech updated LUCENE-8606:

Attachment: (was: 
0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch)

> ConstantScoreQuery looses explain details of wrapped query
> --
>
> Key: LUCENE-8606
> URL: https://issues.apache.org/jira/browse/LUCENE-8606
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Christian Ziech
>Priority: Major
> Attachments: 
> 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, 
> 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch
>
>
> Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not 
> adding the details of the wrapped query to the explanation. 
> {code}
> if (exists) {
> return Explanation.match(score, getQuery().toString() + (score == 1f ? "" 
> : "^" + score));
> } else {
> return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
> doc);
> }
> {code}
> This is kind of inconvenient as it makes it kind of hard to figure out which 
> term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
> clause of another BooleanQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Ziech updated LUCENE-8606:

Attachment: 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch

> ConstantScoreQuery looses explain details of wrapped query
> --
>
> Key: LUCENE-8606
> URL: https://issues.apache.org/jira/browse/LUCENE-8606
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Christian Ziech
>Priority: Major
> Attachments: 
> 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, 
> 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch
>
>
> Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not 
> adding the details of the wrapped query to the explanation. 
> {code}
> if (exists) {
> return Explanation.match(score, getQuery().toString() + (score == 1f ? "" 
> : "^" + score));
> } else {
> return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
> doc);
> }
> {code}
> This is kind of inconvenient as it makes it kind of hard to figure out which 
> term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
> clause of another BooleanQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719183#comment-16719183
 ] 

Christian Ziech commented on LUCENE-8606:
-

Attached a new patch that fixes all but 2 test failures:
{noformat}
   [junit4] Tests with failures [seed: 9F8CCC24EB4194B4]:
   [junit4]   - org.apache.lucene.search.TestComplexExplanations.test2
   [junit4]   - 
org.apache.lucene.search.TestComplexExplanationsOfNonMatches.test2
{noformat}
Those two failures are both caused by a NPE in the LeafSimScorer which is 
caused by the SpanWeight trying to explain a result with a "null" scorer.

Also I had to include a kind of controversial change in the patch which removes 
the assertion "assert scoreMode.needsScores()" from the score() method of the 
AssertingScorer. The problem is that the explain method of the BooleanQuery is 
invoking the score() function to fill the value of the Explanation() object and 
if that BooleanQuery is explained in the context of a ConstantScoreQuery, this 
assertion would fire. 
I first tried to compute the value of the Explanation based on the detail 
explanations in the BooleanQuery, but that didn't quite add up due to 
double/float inaccuracies.


> ConstantScoreQuery looses explain details of wrapped query
> --
>
> Key: LUCENE-8606
> URL: https://issues.apache.org/jira/browse/LUCENE-8606
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Christian Ziech
>Priority: Major
> Attachments: 
> 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, 
> 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch
>
>
> Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not 
> adding the details of the wrapped query to the explanation. 
> {code}
> if (exists) {
> return Explanation.match(score, getQuery().toString() + (score == 1f ? "" 
> : "^" + score));
> } else {
> return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
> doc);
> }
> {code}
> This is kind of inconvenient as it makes it kind of hard to figure out which 
> term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
> clause of another BooleanQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Ziech updated LUCENE-8606:

Attachment: (was: 
0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch)

> ConstantScoreQuery looses explain details of wrapped query
> --
>
> Key: LUCENE-8606
> URL: https://issues.apache.org/jira/browse/LUCENE-8606
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Christian Ziech
>Priority: Major
> Attachments: 
> 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, 
> 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch
>
>
> Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not 
> adding the details of the wrapped query to the explanation. 
> {code}
> if (exists) {
> return Explanation.match(score, getQuery().toString() + (score == 1f ? "" 
> : "^" + score));
> } else {
> return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
> doc);
> }
> {code}
> This is kind of inconvenient as it makes it kind of hard to figure out which 
> term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
> clause of another BooleanQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Ziech updated LUCENE-8606:

Attachment: 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch

> ConstantScoreQuery looses explain details of wrapped query
> --
>
> Key: LUCENE-8606
> URL: https://issues.apache.org/jira/browse/LUCENE-8606
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Christian Ziech
>Priority: Major
> Attachments: 
> 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, 
> 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch
>
>
> Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not 
> adding the details of the wrapped query to the explanation. 
> {code}
> if (exists) {
> return Explanation.match(score, getQuery().toString() + (score == 1f ? "" 
> : "^" + score));
> } else {
> return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
> doc);
> }
> {code}
> This is kind of inconvenient as it makes it kind of hard to figure out which 
> term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
> clause of another BooleanQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719068#comment-16719068
 ] 

Christian Ziech commented on LUCENE-8606:
-

Getting the tests you mentioned to work is the easy part. The harder part is 
that explaining a BooleanWeight which was created with "needsScores == false" 
is running into assertions...

> ConstantScoreQuery looses explain details of wrapped query
> --
>
> Key: LUCENE-8606
> URL: https://issues.apache.org/jira/browse/LUCENE-8606
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Christian Ziech
>Priority: Major
> Attachments: 
> 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, 
> 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch
>
>
> Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not 
> adding the details of the wrapped query to the explanation. 
> {code}
> if (exists) {
> return Explanation.match(score, getQuery().toString() + (score == 1f ? "" 
> : "^" + score));
> } else {
> return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
> doc);
> }
> {code}
> This is kind of inconvenient as it makes it kind of hard to figure out which 
> term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
> clause of another BooleanQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719035#comment-16719035
 ] 

Christian Ziech commented on LUCENE-8606:
-

I'm running them right now - somehow I remembered that there was a jenkins job 
that would auto-verify patches in tickets. But most likely I remembered that 
wrong.

> ConstantScoreQuery looses explain details of wrapped query
> --
>
> Key: LUCENE-8606
> URL: https://issues.apache.org/jira/browse/LUCENE-8606
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Christian Ziech
>Priority: Major
> Attachments: 
> 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, 
> 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch
>
>
> Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not 
> adding the details of the wrapped query to the explanation. 
> {code}
> if (exists) {
> return Explanation.match(score, getQuery().toString() + (score == 1f ? "" 
> : "^" + score));
> } else {
> return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
> doc);
> }
> {code}
> This is kind of inconvenient as it makes it kind of hard to figure out which 
> term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
> clause of another BooleanQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719028#comment-16719028
 ] 

Christian Ziech commented on LUCENE-8606:
-

Sure - that definitely makes the patch smaller. I attached  
[^0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch] which 
performs the change only for the ConstantScoreQuery and the 
CachingWrapperWeight. This adds a small amount of code duplication though.

> ConstantScoreQuery looses explain details of wrapped query
> --
>
> Key: LUCENE-8606
> URL: https://issues.apache.org/jira/browse/LUCENE-8606
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Christian Ziech
>Priority: Major
> Attachments: 
> 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, 
> 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch
>
>
> Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not 
> adding the details of the wrapped query to the explanation. 
> {code}
> if (exists) {
> return Explanation.match(score, getQuery().toString() + (score == 1f ? "" 
> : "^" + score));
> } else {
> return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
> doc);
> }
> {code}
> This is kind of inconvenient as it makes it kind of hard to figure out which 
> term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
> clause of another BooleanQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Ziech updated LUCENE-8606:

Attachment: 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch

> ConstantScoreQuery looses explain details of wrapped query
> --
>
> Key: LUCENE-8606
> URL: https://issues.apache.org/jira/browse/LUCENE-8606
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Christian Ziech
>Priority: Major
> Attachments: 
> 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch, 
> 0001-LUCENE-8606-overwriting-the-explain-method-for-Cachi.patch
>
>
> Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not 
> adding the details of the wrapped query to the explanation. 
> {code}
> if (exists) {
> return Explanation.match(score, getQuery().toString() + (score == 1f ? "" 
> : "^" + score));
> } else {
> return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
> doc);
> }
> {code}
> This is kind of inconvenient as it makes it kind of hard to figure out which 
> term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
> clause of another BooleanQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719003#comment-16719003
 ] 

Christian Ziech commented on LUCENE-8606:
-

[~jpountz] sure I could also just overwrite the CachingWrapperWeight of the 
LRUQueryCache and the anonymous inner class of the ConstantScoreQuery ... as 
you prefer.

> ConstantScoreQuery looses explain details of wrapped query
> --
>
> Key: LUCENE-8606
> URL: https://issues.apache.org/jira/browse/LUCENE-8606
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Christian Ziech
>Priority: Major
> Attachments: 
> 0001-LUCENE-8606-adding-a-constructor-for-the-ConstantSco.patch
>
>
> Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not 
> adding the details of the wrapped query to the explanation. 
> {code}
> if (exists) {
> return Explanation.match(score, getQuery().toString() + (score == 1f ? "" 
> : "^" + score));
> } else {
> return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
> doc);
> }
> {code}
> This is kind of inconvenient as it makes it kind of hard to figure out which 
> term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
> clause of another BooleanQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718830#comment-16718830
 ] 

Christian Ziech edited comment on LUCENE-8606 at 12/12/18 11:56 AM:


Sure I would be happy to supply a change but I think I'd first like to agree on 
an approach. The main challenge is imo that the ConstantScoreWeigth is 
subclassed at a lot of places in- and outside of lucene and changing the 
Constructor signature would be possibly problematic.
So I'd add another constructor that would pass in the wrapped Weigth. So the 
behavior would be improved if the wrapped weight is passed in, and would remain 
unchanged if that weight is missing. 
A fix that changes all usages of the ConstantScoreWeigth is possible too but 
that would break compatibility (as I'd replace the Query parameter with a 
Weight parameter instead of just adding an alternative constructor) ...


was (Author: christianziech):
Sure I would be happy to supply a change but I think I'd first like to agree on 
an approach. The main challenge is imo that the ConstantScoreWeigth is 
subclassed at a lot of places in an outside of lucene and changing the 
Constructor signature would be possibly problematic.
So I'd add another constructor that would pass in the wrapped Weigth. So the 
behavior would be improved if the wrapped weight is passed in, and would remain 
unchanged if that weight is missing. 
A fix that changes all usages of the ConstantScoreWeigth is possible too but 
that would break compatibility (as I'd replace the Query parameter with a 
Weight parameter instead of just adding an alternative constructor) ...

> ConstantScoreQuery looses explain details of wrapped query
> --
>
> Key: LUCENE-8606
> URL: https://issues.apache.org/jira/browse/LUCENE-8606
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Christian Ziech
>Priority: Major
>
> Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not 
> adding the details of the wrapped query to the explanation. 
> {code}
> if (exists) {
> return Explanation.match(score, getQuery().toString() + (score == 1f ? "" 
> : "^" + score));
> } else {
> return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
> doc);
> }
> {code}
> This is kind of inconvenient as it makes it kind of hard to figure out which 
> term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
> clause of another BooleanQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718830#comment-16718830
 ] 

Christian Ziech commented on LUCENE-8606:
-

Sure I would be happy to supply a change but I think I'd first like to agree on 
an approach. The main challenge is imo that the ConstantScoreWeigth is 
subclassed at a lot of places in an outside of lucene and changing the 
Constructor signature would be possibly problematic.
So I'd add another constructor that would pass in the wrapped Weigth. So the 
behavior would be improved if the wrapped weight is passed in, and would remain 
unchanged if that weight is missing. 
A fix that changes all usages of the ConstantScoreWeigth is possible too but 
that would break compatibility (as I'd replace the Query parameter with a 
Weight parameter instead of just adding an alternative constructor) ...

> ConstantScoreQuery looses explain details of wrapped query
> --
>
> Key: LUCENE-8606
> URL: https://issues.apache.org/jira/browse/LUCENE-8606
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Christian Ziech
>Priority: Major
>
> Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not 
> adding the details of the wrapped query to the explanation. 
> {code}
> if (exists) {
> return Explanation.match(score, getQuery().toString() + (score == 1f ? "" 
> : "^" + score));
> } else {
> return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
> doc);
> }
> {code}
> This is kind of inconvenient as it makes it kind of hard to figure out which 
> term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
> clause of another BooleanQuery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8606) ConstantScoreQuery looses explain details of wrapped query

2018-12-12 Thread Christian Ziech (JIRA)

Christian Ziech created LUCENE-8606:
---

 Summary: ConstantScoreQuery looses explain details of wrapped query
 Key: LUCENE-8606
 URL: https://issues.apache.org/jira/browse/LUCENE-8606
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Christian Ziech


Right now the ConstantScoreWeigth used by the ConstantScoreQuery is not adding 
the details of the wrapped query to the explanation. 
{code}
if (exists) {
return Explanation.match(score, getQuery().toString() + (score == 1f ? "" : 
"^" + score));
} else {
return Explanation.noMatch(getQuery().toString() + " doesn't match id " + 
doc);
}
{code}
This is kind of inconvenient as it makes it kind of hard to figure out which 
term finally really matched when one e.g. puts a BooleanQuery into the FILTER 
clause of another BooleanQuery.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8219) LevenshteinAutomata should estimate the number of states and transitions

2018-03-22 Thread Christian Ziech (JIRA)

Christian Ziech created LUCENE-8219:
---

 Summary: LevenshteinAutomata should estimate the number of states 
and transitions
 Key: LUCENE-8219
 URL: https://issues.apache.org/jira/browse/LUCENE-8219
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Christian Ziech


Currently the toAutomaton() method of the LevenshteinAutomata class uses the 
default constructor of the Automaton although it exactly knows how many states 
the automaton will have and can do a reasonable estimation on how many 
transitions it will need as well.

I suggest changing the lines {code:language=java|firstline=154|linenumbers=true}
// the number of states is based on the length of the word and n
int numStates = description.size();

Automaton a = new Automaton();
int lastState;
{code}
to
{code:language=java|firstline=154|linenumbers=true}
// the number of states is based on the length of the word and n
final int numStates = description.size();
final int numTransitions = numStates * Math.min(1 + 2 * n, alphabet.length);
final int prefixStates = prefix != null ? prefix.codePointCount(0, 
prefix.length()) : 0;

final Automaton a = new Automaton(numStates + prefixStates, numTransitions);
int lastState;
{code}
For my test data this cut down on the total amount of memory needed for int 
arrays by factor 4. The estimation of "1 + 2 * editDistance" should maybe 
rather be replaced by a value coming from the ParametricDescription used.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5875) Default page/block sizes in the FST package can cause OOMs

2014-08-11 Thread Christian Ziech (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092849#comment-14092849
]

Christian Ziech commented on LUCENE-5875:
-

Exactly - when the PagedGrowableWriter in the NodeHash used 127 bytes for a
page things worked like a charm (with maxint as suffix length,
doShareNonSingletonNodes set to true and both of the min suffix counts set to
0).

What numbers are you interested in? With doShareSuffix enabled the FST takes
3.1 GB of disk space: I quickly fetched the following numbers:
- arcCount: 561802889
- nodeCount: 291569846
- arcWithOutputCount: 201469018

While in theory the nodeCount should hence be lower than 2.1B I think we also
got an exception when enabling packing. But I'm not sure if we tried it in
conjunction with doShareSuffix.

Default page/block sizes in the FST package can cause OOMs
--

Key: LUCENE-5875
URL: https://issues.apache.org/jira/browse/LUCENE-5875
Project: Lucene - Core
Issue Type: Improvement
Components: core/FSTs
Affects Versions: 4.9
Reporter: Christian Ziech
Priority: Minor

We are building some fairly big FSTs (the biggest one having about 500M terms
with an average of 20 characters per term) and that works very well so far.
The problem is just that we can use neither the doShareSuffix nor the
doPackFST option from the builder since both would cause us to get
exceptions. One beeing an OOM and the other an IllegalArgumentException for a
negative array size in ArrayUtil.
The thing here is that we in theory still have far more than enough memory
available but it seems that java for some reason cannot allocate byte or long
arrays of the size the NodeHash needs (maybe fragmentation?).
Reducing the constant in the NodeHash from 130 to e.g. 27 seems to fix the
issue mostly. Could e.g. the Builder pass through its bytesPageBits to the
NodeHash or could we get a custom parameter for that?
The other problem we run into was a NegativeArraySizeException when we try to
pack the FST. It seems that we overflowed to 0x8000. Unfortunately I
accidentally overwrote that exception but I remember it was triggered by the
GrowableWriter for the inCounts in line 728 of the FST. If it helps I can try
to reproduce it.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5875) Default page/block sizes in the FST package can cause OOMs

2014-08-11 Thread Christian Ziech (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092932#comment-14092932
]

Christian Ziech commented on LUCENE-5875:
-

{quote}
Hmm, something else is wrong then ... or was this just an OOME? If not, can you
reproduce the non-OOME when turning on packing despite node count being well
below 2.1B?
{quote}

Sure - give me 1-2 days and I'll paste it here.

Default page/block sizes in the FST package can cause OOMs
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5875) Default page/block sizes in the FST package can cause OOMs

2014-08-07 Thread Christian Ziech (JIRA)

Christian Ziech created LUCENE-5875:
---

 Summary: Default page/block sizes in the FST package can cause OOMs
 Key: LUCENE-5875
 URL: https://issues.apache.org/jira/browse/LUCENE-5875
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.9
Reporter: Christian Ziech
Priority: Minor


We are building some fairly big FSTs (the biggest one having about 500M terms 
with an average of 20 characters per term) and that works very well so far.
The problem is just that we can use neither the doShareSuffix nor the 
doPackFST option from the builder since both would cause us to get 
exceptions. One beeing an OOM and the other an IllegalArgumentException for a 
negative array size in ArrayUtil.

The thing here is that we in theory still have far more than enough memory 
available but it seems that java for some reason cannot allocate byte or long 
arrays of the size the NodeHash needs (maybe fragmentation?).

Reducing the constant in the NodeHash from 130 to e.g. 27 seems to fix the 
issue mostly. Could e.g. the Builder pass through its bytesPageBits to the 
NodeHash or could we get a custom parameter for that?

The other problem we run into was a NegativeArraySizeException when we try to 
pack the FST. It seems that we overflowed to 0x8000. Unfortunately I 
accidentally overwrote that exception but I remember it was triggered by the 
GrowableWriter for the inCounts in line 728 of the FST. If it helps I can try 
to reproduce it.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5875) Default page/block sizes in the FST package can cause OOMs

2014-08-07 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089366#comment-14089366
 ] 

Christian Ziech commented on LUCENE-5875:
-

Oh there is another OOM we get: At the time the exception was thrown we were 
indexing for 5-6 hours and have closed the IndexWriter already. Now we only 
want to store the special terms we gathered during indexing into a custom FST. 
At the point in time the Exception was thrown effectively one one thread was 
active in the VM the last attempt of a GC printed the following:
Eden: 0B(4021M)-0B(4021M) Survivors: 75M-75M Heap: 
9615M(30720M)-9615M(30720M)
Those values are also pretty much in line with the numbers we get from the 
runtime if we add custom debug statements.

java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.packed.Packed64.init(Packed64.java:73)
at 
org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:1034)
at 
org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:1001)
at 
org.apache.lucene.util.packed.GrowableWriter.init(GrowableWriter.java:46)
at 
org.apache.lucene.util.packed.GrowableWriter.resize(GrowableWriter.java:98)
at org.apache.lucene.util.fst.FST.addNode(FST.java:845)
at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:200)
at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:289)
at org.apache.lucene.util.fst.Builder.add(Builder.java:394)
at 
com.nokia.search.candgen.spelling.AtomicFSTBuilder$FSTWriter.put(AtomicFSTBuilder.java:358)
at 
com.nokia.search.candgen.spelling.AtomicFSTBuilder$WriteTask.run(AtomicFSTBuilder.java:156)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)


 Default page/block sizes in the FST package can cause OOMs
 --

 Key: LUCENE-5875
 URL: https://issues.apache.org/jira/browse/LUCENE-5875
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.9
Reporter: Christian Ziech
Priority: Minor

 We are building some fairly big FSTs (the biggest one having about 500M terms 
 with an average of 20 characters per term) and that works very well so far.
 The problem is just that we can use neither the doShareSuffix nor the 
 doPackFST option from the builder since both would cause us to get 
 exceptions. One beeing an OOM and the other an IllegalArgumentException for a 
 negative array size in ArrayUtil.
 The thing here is that we in theory still have far more than enough memory 
 available but it seems that java for some reason cannot allocate byte or long 
 arrays of the size the NodeHash needs (maybe fragmentation?).
 Reducing the constant in the NodeHash from 130 to e.g. 27 seems to fix the 
 issue mostly. Could e.g. the Builder pass through its bytesPageBits to the 
 NodeHash or could we get a custom parameter for that?
 The other problem we run into was a NegativeArraySizeException when we try to 
 pack the FST. It seems that we overflowed to 0x8000. Unfortunately I 
 accidentally overwrote that exception but I remember it was triggered by the 
 GrowableWriter for the inCounts in line 728 of the FST. If it helps I can try 
 to reproduce it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-5875) Default page/block sizes in the FST package can cause OOMs

2014-08-07 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089366#comment-14089366
 ] 

Christian Ziech edited comment on LUCENE-5875 at 8/7/14 3:54 PM:
-

Oh there is another OOM we get: At the time the exception was thrown we were 
indexing for 5-6 hours and have closed the IndexWriter already. Now we only 
want to store the special terms we gathered during indexing into a custom FST. 
At the point in time the Exception was thrown effectively one one thread was 
active in the VM the last attempt of a GC printed the following:
Eden: 0B(4021M)-0B(4021M) Survivors: 75M-75M Heap: 
9615M(30720M)-9615M(30720M)
Those values are also pretty much in line with the numbers we get from the 
runtime if we add custom debug statements.
{code}
java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.packed.Packed64.init(Packed64.java:73)
at 
org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:1034)
at 
org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:1001)
at 
org.apache.lucene.util.packed.GrowableWriter.init(GrowableWriter.java:46)
at 
org.apache.lucene.util.packed.GrowableWriter.resize(GrowableWriter.java:98)
at org.apache.lucene.util.fst.FST.addNode(FST.java:845)
at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:200)
at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:289)
at org.apache.lucene.util.fst.Builder.add(Builder.java:394)
at 
com.nokia.search.candgen.spelling.AtomicFSTBuilder$FSTWriter.put(AtomicFSTBuilder.java:358)
at 
com.nokia.search.candgen.spelling.AtomicFSTBuilder$WriteTask.run(AtomicFSTBuilder.java:156)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
{code}


was (Author: christianz):
Oh there is another OOM we get: At the time the exception was thrown we were 
indexing for 5-6 hours and have closed the IndexWriter already. Now we only 
want to store the special terms we gathered during indexing into a custom FST. 
At the point in time the Exception was thrown effectively one one thread was 
active in the VM the last attempt of a GC printed the following:
Eden: 0B(4021M)-0B(4021M) Survivors: 75M-75M Heap: 
9615M(30720M)-9615M(30720M)
Those values are also pretty much in line with the numbers we get from the 
runtime if we add custom debug statements.

java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.packed.Packed64.init(Packed64.java:73)
at 
org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:1034)
at 
org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:1001)
at 
org.apache.lucene.util.packed.GrowableWriter.init(GrowableWriter.java:46)
at 
org.apache.lucene.util.packed.GrowableWriter.resize(GrowableWriter.java:98)
at org.apache.lucene.util.fst.FST.addNode(FST.java:845)
at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:200)
at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:289)
at org.apache.lucene.util.fst.Builder.add(Builder.java:394)
at 
com.nokia.search.candgen.spelling.AtomicFSTBuilder$FSTWriter.put(AtomicFSTBuilder.java:358)
at 
com.nokia.search.candgen.spelling.AtomicFSTBuilder$WriteTask.run(AtomicFSTBuilder.java:156)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)


 Default page/block sizes in the FST package can cause OOMs
 --

 Key: LUCENE-5875
 URL: https://issues.apache.org/jira/browse/LUCENE-5875
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.9
Reporter: Christian Ziech
Priority: Minor

 We are building some fairly big FSTs (the biggest one having about 500M terms 
 with an average of 20 characters per term) and that works very well so far.
 The problem is just that we can use neither the doShareSuffix nor the 
 doPackFST option from the builder since both would cause us to get 
 exceptions. One beeing an OOM and the other an IllegalArgumentException for a 
 negative array size in ArrayUtil.
 The thing here is that we in theory still have far more than enough memory 
 available but it seems that java for some reason cannot allocate byte or long 
 arrays of the size the NodeHash needs (maybe fragmentation?).
 Reducing the constant in the NodeHash from 130 to e.g. 27

[jira] [Updated] (LUCENE-5670) org.apache.lucene.util.fst.FST should skip over outputs it is not interested in

2014-05-19 Thread Christian Ziech (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Ziech updated LUCENE-5670:


Attachment: skipOutput_lucene48.patch

Attaching a patch relative to the lucene 4.8 branch.

 org.apache.lucene.util.fst.FST should skip over outputs it is not interested 
 in
 ---

 Key: LUCENE-5670
 URL: https://issues.apache.org/jira/browse/LUCENE-5670
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.7
Reporter: Christian Ziech
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-5670.patch, skipOutput_lucene48.patch


 Currently the FST uses the read(DataInput) method from the Outputs class to 
 skip over outputs it actually is not interested in. For most use cases this 
 just creates some additional objects that are immediately destroyed again.
 When traversing an FST with non-trivial data however this can easily add up 
 to several excess objects that nobody actually ever read.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

2014-05-16 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998560#comment-13998560
 ] 

Christian Ziech commented on LUCENE-5584:
-

The additional ctor would be a solution as well, yes. We then could keep the 
FSTs in some cache and use one per thread. 

 Allow FST read method to also recycle the output value when traversing FST
 --

 Key: LUCENE-5584
 URL: https://issues.apache.org/jira/browse/LUCENE-5584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.7.1
Reporter: Christian Ziech
 Attachments: fst-itersect-benchmark.tgz


 The FST class heavily reuses Arc instances when traversing the FST. The 
 output of an Arc however is not reused. This can especially be important when 
 traversing large portions of a FST and using the ByteSequenceOutputs and 
 CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
 node read (which has an output).
 In our use case we intersect a lucene Automaton with a FSTBytesRef much 
 like it is done in 
 org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
 since the Automaton and the FST are both rather large tens or even hundreds 
 of thousands of temporary byte array objects are created.
 One possible solution to the problem would be to change the 
 org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
 you don't want to change the existing methods for compatibility):
 {code}
   /** Decode an output value previously written with {@link
*  #write(Object, DataOutput)} reusing the object passed in if possible */
   public abstract T read(DataInput in, T reuse) throws IOException;
   /** Decode an output value previously written with {@link
*  #writeFinalOutput(Object, DataOutput)}.  By default this
*  just calls {@link #read(DataInput)}. This tries to  reuse the object   
*  passed in if possible */
   public T readFinalOutput(DataInput in, T reuse) throws IOException {
 return read(in, reuse);
   }
 {code}
 The new methods could then be used in the FST in the readNextRealArc() method 
 passing in the output of the reused Arc. For most inputs they could even just 
 invoke the original read(in) method.
 If you should decide to make that change I'd be happy to supply a patch 
 and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-5670) org.apache.lucene.util.fst.FST should skip over outputs it is not interested in

2014-05-15 Thread Christian Ziech (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997744#comment-13997744
]

Christian Ziech edited comment on LUCENE-5670 at 5/14/14 4:53 PM:
--

Oh right! I only checked the 4.7 branch and there the DataInput didn't have the
skipBytes() method yet. But now I saw that both trunk and the 4.8 branch have
the skipBytes(long) already. So yes of course in that case we can drop it from
the patch. If we can get consensus that the rest of the patch is worth doing I
could implement it against 4.8 and attach it here.

Edit: The ticket that added the skipBytes to the DataInput was LUCENE-5583

was (Author: christianz):
Oh right! I only checked the 4.7 branch and there the DataInput didn't have the
skipBytes() method yet. But now I saw that both trunk and the 4.8 branch have
the skipBytes(long) already. So yes of course in that case we can drop it from
the patch. If we can get consensus that the rest of the patch is worth doing I
could implement it against 4.8 and attach it here.

org.apache.lucene.util.fst.FST should skip over outputs it is not interested
in
---

Key: LUCENE-5670
URL: https://issues.apache.org/jira/browse/LUCENE-5670
Project: Lucene - Core
Issue Type: Improvement
Affects Versions: 4.7
Reporter: Christian Ziech
Assignee: Michael McCandless
Priority: Minor
Attachments: LUCENE-5670.patch

Currently the FST uses the read(DataInput) method from the Outputs class to
skip over outputs it actually is not interested in. For most use cases this
just creates some additional objects that are immediately destroyed again.
When traversing an FST with non-trivial data however this can easily add up
to several excess objects that nobody actually ever read.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5670) org.apache.lucene.util.fst.FST should skip over outputs it is not interested in

2014-05-15 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997633#comment-13997633
 ] 

Christian Ziech commented on LUCENE-5670:
-

No actually only some of the subclasses of DataInput had a skipBytes() 
implementation - e.g. the BytesReader() intermediate abstract class added it to 
the interface and also the ByteArrayDataInput had it before. Maybe one should 
scan over all the other implementations if they had a similar method that was 
just named differently or could implement it (e.g. IndexInput could easily 
implement the skip method as a comination of seek and getFilePointer).

 org.apache.lucene.util.fst.FST should skip over outputs it is not interested 
 in
 ---

 Key: LUCENE-5670
 URL: https://issues.apache.org/jira/browse/LUCENE-5670
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.7
Reporter: Christian Ziech
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-5670.patch


 Currently the FST uses the read(DataInput) method from the Outputs class to 
 skip over outputs it actually is not interested in. For most use cases this 
 just creates some additional objects that are immediately destroyed again.
 When traversing an FST with non-trivial data however this can easily add up 
 to several excess objects that nobody actually ever read.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5670) org.apache.lucene.util.fst.FST should skip over outputs it is not interested in

2014-05-14 Thread Christian Ziech (JIRA)

Christian Ziech created LUCENE-5670:
---

 Summary: org.apache.lucene.util.fst.FST should skip over outputs 
it is not interested in
 Key: LUCENE-5670
 URL: https://issues.apache.org/jira/browse/LUCENE-5670
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.7
Reporter: Christian Ziech
Priority: Minor


Currently the FST uses the read(DataInput) method from the Outputs class to 
skip over outputs it actually is not interested in. For most use cases this 
just creates some additional objects that are immediately destroyed again.

When traversing an FST with non-trivial data however this can easily add up to 
several excess objects that nobody actually ever read.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5670) org.apache.lucene.util.fst.FST should skip over outputs it is not interested in

2014-05-14 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997744#comment-13997744
 ] 

Christian Ziech commented on LUCENE-5670:
-

Oh right! I only checked the 4.7 branch and there the DataInput didn't have the 
skipBytes() method yet. But now I saw that both trunk and the 4.8 branch have 
the skipBytes(long) already. So yes of course in that case we can drop it from 
the patch. If we can get consensus that the rest of the patch is worth doing I 
could implement it against 4.8 and attach it here.

 org.apache.lucene.util.fst.FST should skip over outputs it is not interested 
 in
 ---

 Key: LUCENE-5670
 URL: https://issues.apache.org/jira/browse/LUCENE-5670
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.7
Reporter: Christian Ziech
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-5670.patch


 Currently the FST uses the read(DataInput) method from the Outputs class to 
 skip over outputs it actually is not interested in. For most use cases this 
 just creates some additional objects that are immediately destroyed again.
 When traversing an FST with non-trivial data however this can easily add up 
 to several excess objects that nobody actually ever read.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5670) org.apache.lucene.util.fst.FST should skip over outputs it is not interested in

2014-05-14 Thread Christian Ziech (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Ziech updated LUCENE-5670:


Attachment: LUCENE-5670.patch

Attached an (untested) patch where a skipOutput method is added to the 
outputs which doesn't create excess objects. Default implementation is the 
current behavior by invoking the read() method.

Also a skipBytes(int) method was added to the DataInput which defaults to 
reading the data as before. Several implementations of the DataInput already 
had a skipBytes() method and now effectively implement it.

 org.apache.lucene.util.fst.FST should skip over outputs it is not interested 
 in
 ---

 Key: LUCENE-5670
 URL: https://issues.apache.org/jira/browse/LUCENE-5670
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.7
Reporter: Christian Ziech
Priority: Minor
 Attachments: LUCENE-5670.patch


 Currently the FST uses the read(DataInput) method from the Outputs class to 
 skip over outputs it actually is not interested in. For most use cases this 
 just creates some additional objects that are immediately destroyed again.
 When traversing an FST with non-trivial data however this can easily add up 
 to several excess objects that nobody actually ever read.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

2014-05-13 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996249#comment-13996249
 ] 

Christian Ziech commented on LUCENE-5584:
-

{quote}
I'm also confused on why a custom Outputs impl that secretly reuses isn't 
sufficient here.
{quote}

The problem is that Outputs.add() is just one of the locations that would 
create a new output-value instance. We do the reusing you suggested for the 
target of the add operation in our implementation and it works fine but it only 
fixes 50% of the allocation sites. The other half of the sites is in the 
Outputs.read() method. To fix this site was the reason for this ticket.

Basically it would be fine for us that if every call to Outputs.read() would 
return the very same output-value since we only use it to add to the result we 
build internally (and throw it away thereafter). Since however multiple threads 
use the same FST in parallel this can only be achieved by a hack that uses a 
ThreadLocal to prevent threading issues with multiple threads using the same 
output-value instance. This also has the stench of adding a (hidden) state to a 
supposedly stateless class. 

 Allow FST read method to also recycle the output value when traversing FST
 --

 Key: LUCENE-5584
 URL: https://issues.apache.org/jira/browse/LUCENE-5584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.7.1
Reporter: Christian Ziech
 Attachments: fst-itersect-benchmark.tgz


 The FST class heavily reuses Arc instances when traversing the FST. The 
 output of an Arc however is not reused. This can especially be important when 
 traversing large portions of a FST and using the ByteSequenceOutputs and 
 CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
 node read (which has an output).
 In our use case we intersect a lucene Automaton with a FSTBytesRef much 
 like it is done in 
 org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
 since the Automaton and the FST are both rather large tens or even hundreds 
 of thousands of temporary byte array objects are created.
 One possible solution to the problem would be to change the 
 org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
 you don't want to change the existing methods for compatibility):
 {code}
   /** Decode an output value previously written with {@link
*  #write(Object, DataOutput)} reusing the object passed in if possible */
   public abstract T read(DataInput in, T reuse) throws IOException;
   /** Decode an output value previously written with {@link
*  #writeFinalOutput(Object, DataOutput)}.  By default this
*  just calls {@link #read(DataInput)}. This tries to  reuse the object   
*  passed in if possible */
   public T readFinalOutput(DataInput in, T reuse) throws IOException {
 return read(in, reuse);
   }
 {code}
 The new methods could then be used in the FST in the readNextRealArc() method 
 passing in the output of the reused Arc. For most inputs they could even just 
 invoke the original read(in) method.
 If you should decide to make that change I'd be happy to supply a patch 
 and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

2014-05-12 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994934#comment-13994934
 ] 

Christian Ziech commented on LUCENE-5584:
-

Optimizing our code for intersecting an automaton with an FST (inspired by 
org.apache.lucene.search.suggest.analyzing.FSTUtil#intersectPrefixPaths) I came 
across the following locations that create objects that actually could do 
without:
- the scratchArc is created for every node in the automaton
- for every state in the Automaton an iterator is created implicitly when 
iterating over the Transitions of the state
- outputs.add() creates a new outputs value object for every state of the 
automaton if the corresponding FST state had an output
- for every transition visited a new IntsRef instance is created
- for every FST node read a new outputs value object is created

All except the last allocation location was fixed easily:
- we keep the scratch arcs in a Stack and hence only create one per level of 
the automaton (about 10-15 levels for us)
- we iterate over the states using an int index in the transitions array
- we replaced outputs add by our own method that just appends the outputs of 
the FST Arc to a single outputs value per intersect call and then upon exiting 
the recursion just removes it again
- same goes for the input IntsRef - we have one instance that is just modified 
as we traverse the automaton/FST

For the last allocation location we now have gone with a special Outputs 
implementation that uses a rather ugly construct to always return the very same 
outputs instance for the iterate case per Thread. Thinking about the problem 
again I came to think of another (easier) solution to that problem. If the 
outputs of the FST wouldn't actually be a field of the FST itself but if they 
would be under control of the caller of the FST read*Arc methods just like the 
BytesReader is, we wouldn't have the problem (maybe instead of the 
BytesReader). That way we could just create a new Outputs instance for each of 
our intersection runs and wouldn't need to resort that construct which attaches 
a state to something that is not meant to have a state.

 Allow FST read method to also recycle the output value when traversing FST
 --

 Key: LUCENE-5584
 URL: https://issues.apache.org/jira/browse/LUCENE-5584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.7.1
Reporter: Christian Ziech
 Attachments: fst-itersect-benchmark.tgz


 The FST class heavily reuses Arc instances when traversing the FST. The 
 output of an Arc however is not reused. This can especially be important when 
 traversing large portions of a FST and using the ByteSequenceOutputs and 
 CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
 node read (which has an output).
 In our use case we intersect a lucene Automaton with a FSTBytesRef much 
 like it is done in 
 org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
 since the Automaton and the FST are both rather large tens or even hundreds 
 of thousands of temporary byte array objects are created.
 One possible solution to the problem would be to change the 
 org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
 you don't want to change the existing methods for compatibility):
 {code}
   /** Decode an output value previously written with {@link
*  #write(Object, DataOutput)} reusing the object passed in if possible */
   public abstract T read(DataInput in, T reuse) throws IOException;
   /** Decode an output value previously written with {@link
*  #writeFinalOutput(Object, DataOutput)}.  By default this
*  just calls {@link #read(DataInput)}. This tries to  reuse the object   
*  passed in if possible */
   public T readFinalOutput(DataInput in, T reuse) throws IOException {
 return read(in, reuse);
   }
 {code}
 The new methods could then be used in the FST in the readNextRealArc() method 
 passing in the output of the reused Arc. For most inputs they could even just 
 invoke the original read(in) method.
 If you should decide to make that change I'd be happy to supply a patch 
 and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

2014-04-14 Thread Christian Ziech (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Ziech updated LUCENE-5584:


Attachment: fst-itersect-benchmark.tgz

Sorry it took me quite some time to assemble a benchmark that I can attach 
here. I basically copied the FST and other required classes to the project and 
modified the FST so that it supports reusing of value object of Arcs. As said: 
the overhead with the long instances is quite manageable (about 3-5% for the 
overall execution time in the benchmark) - but there is still some. The main 
advantage we get from putting the values into the FST was the total space 
requirement. Right now I don't see an easy way to implement the prefix 
compression without the FST ourselves - but I need to think about that more 
carefully.

To run the benchmark yourself you first have to generate a FST using the 
RandomFstBuilder class and afterwards you can use the BenchmarkMutableFst class 
to execute the benchmark. 

PS: I also included a variant that uses the intersect implementation of the 
analyzers package but I don't think that the numbers are really a fair 
comparison since the analyzers variant takes a shortcut we afaik cannot take 
with our automaton.

 Allow FST read method to also recycle the output value when traversing FST
 --

 Key: LUCENE-5584
 URL: https://issues.apache.org/jira/browse/LUCENE-5584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.7.1
Reporter: Christian Ziech
 Attachments: fst-itersect-benchmark.tgz


 The FST class heavily reuses Arc instances when traversing the FST. The 
 output of an Arc however is not reused. This can especially be important when 
 traversing large portions of a FST and using the ByteSequenceOutputs and 
 CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
 node read (which has an output).
 In our use case we intersect a lucene Automaton with a FSTBytesRef much 
 like it is done in 
 org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
 since the Automaton and the FST are both rather large tens or even hundreds 
 of thousands of temporary byte array objects are created.
 One possible solution to the problem would be to change the 
 org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
 you don't want to change the existing methods for compatibility):
 {code}
   /** Decode an output value previously written with {@link
*  #write(Object, DataOutput)} reusing the object passed in if possible */
   public abstract T read(DataInput in, T reuse) throws IOException;
   /** Decode an output value previously written with {@link
*  #writeFinalOutput(Object, DataOutput)}.  By default this
*  just calls {@link #read(DataInput)}. This tries to  reuse the object   
*  passed in if possible */
   public T readFinalOutput(DataInput in, T reuse) throws IOException {
 return read(in, reuse);
   }
 {code}
 The new methods could then be used in the FST in the readNextRealArc() method 
 passing in the output of the reused Arc. For most inputs they could even just 
 invoke the original read(in) method.
 If you should decide to make that change I'd be happy to supply a patch 
 and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

2014-04-13 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967752#comment-13967752
 ] 

Christian Ziech commented on LUCENE-5584:
-

{quote}
If this is the case, then I am not sure you are using the correct 
datastructure: it seems to me that a byte sequence output is not appropriate. 
Since you do not care about the intermediate outputs, but have a complicated 
intersection with the FST, why not use a numeric output, pointing to the result 
data somewhere else?
{quote}

That is what we do right now. This however has the downside that we loose the 
prefix compression capability of the FST for the FST values which is 
significant in our case. The single FST with the values attached was roughly 
1.2G large and now with the referenced byte arrays (we load them into a 
DirectByteBuffer) we spend about 2.5G for the values alone. Of course we could 
try to implement the same prefix compression as the FST does on our own and 
fill a byte array while traversing the FST but that feels like copying 
something that is already almost there. If we could just get the extension 
points I mentioned into Lucene without actually changing the actual behavior of 
(most or any) of lucenes code that would be a huge help.


 Allow FST read method to also recycle the output value when traversing FST
 --

 Key: LUCENE-5584
 URL: https://issues.apache.org/jira/browse/LUCENE-5584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.7.1
Reporter: Christian Ziech

 The FST class heavily reuses Arc instances when traversing the FST. The 
 output of an Arc however is not reused. This can especially be important when 
 traversing large portions of a FST and using the ByteSequenceOutputs and 
 CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
 node read (which has an output).
 In our use case we intersect a lucene Automaton with a FSTBytesRef much 
 like it is done in 
 org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
 since the Automaton and the FST are both rather large tens or even hundreds 
 of thousands of temporary byte array objects are created.
 One possible solution to the problem would be to change the 
 org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
 you don't want to change the existing methods for compatibility):
 {code}
   /** Decode an output value previously written with {@link
*  #write(Object, DataOutput)} reusing the object passed in if possible */
   public abstract T read(DataInput in, T reuse) throws IOException;
   /** Decode an output value previously written with {@link
*  #writeFinalOutput(Object, DataOutput)}.  By default this
*  just calls {@link #read(DataInput)}. This tries to  reuse the object   
*  passed in if possible */
   public T readFinalOutput(DataInput in, T reuse) throws IOException {
 return read(in, reuse);
   }
 {code}
 The new methods could then be used in the FST in the readNextRealArc() method 
 passing in the output of the reused Arc. For most inputs they could even just 
 invoke the original read(in) method.
 If you should decide to make that change I'd be happy to supply a patch 
 and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

2014-04-13 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967752#comment-13967752
 ] 

Christian Ziech edited comment on LUCENE-5584 at 4/13/14 7:04 AM:
--

{quote}
If this is the case, then I am not sure you are using the correct 
datastructure: it seems to me that a byte sequence output is not appropriate. 
Since you do not care about the intermediate outputs, but have a complicated 
intersection with the FST, why not use a numeric output, pointing to the result 
data somewhere else?
{quote}

That is what we do right now. This however has the downside that we loose the 
prefix compression capability of the FST for the FST values which is 
significant in our case. The single FST with the values attached was roughly 
1.2G large and now with the referenced byte arrays (we load them into a 
DirectByteBuffer) we spend about 2.5G for the values alone. Of course we could 
try to implement the same prefix compression as the FST does on our own and 
fill a byte array while traversing the FST but that feels like copying 
something that is already almost there. If we could just get the extension 
points I mentioned into Lucene without actually changing the actual behavior of 
(most or any) of lucenes code that would be a huge help.

Edit: Also with numeric outputs we still suffer from quite a few unwanted long 
references that are created temporarily by the VM just as the byte arrays were 
before. This problem is far less severe and actually manageable though.



was (Author: christianz):
{quote}
If this is the case, then I am not sure you are using the correct 
datastructure: it seems to me that a byte sequence output is not appropriate. 
Since you do not care about the intermediate outputs, but have a complicated 
intersection with the FST, why not use a numeric output, pointing to the result 
data somewhere else?
{quote}

That is what we do right now. This however has the downside that we loose the 
prefix compression capability of the FST for the FST values which is 
significant in our case. The single FST with the values attached was roughly 
1.2G large and now with the referenced byte arrays (we load them into a 
DirectByteBuffer) we spend about 2.5G for the values alone. Of course we could 
try to implement the same prefix compression as the FST does on our own and 
fill a byte array while traversing the FST but that feels like copying 
something that is already almost there. If we could just get the extension 
points I mentioned into Lucene without actually changing the actual behavior of 
(most or any) of lucenes code that would be a huge help.


 Allow FST read method to also recycle the output value when traversing FST
 --

 Key: LUCENE-5584
 URL: https://issues.apache.org/jira/browse/LUCENE-5584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.7.1
Reporter: Christian Ziech

 The FST class heavily reuses Arc instances when traversing the FST. The 
 output of an Arc however is not reused. This can especially be important when 
 traversing large portions of a FST and using the ByteSequenceOutputs and 
 CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
 node read (which has an output).
 In our use case we intersect a lucene Automaton with a FSTBytesRef much 
 like it is done in 
 org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
 since the Automaton and the FST are both rather large tens or even hundreds 
 of thousands of temporary byte array objects are created.
 One possible solution to the problem would be to change the 
 org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
 you don't want to change the existing methods for compatibility):
 {code}
   /** Decode an output value previously written with {@link
*  #write(Object, DataOutput)} reusing the object passed in if possible */
   public abstract T read(DataInput in, T reuse) throws IOException;
   /** Decode an output value previously written with {@link
*  #writeFinalOutput(Object, DataOutput)}.  By default this
*  just calls {@link #read(DataInput)}. This tries to  reuse the object   
*  passed in if possible */
   public T readFinalOutput(DataInput in, T reuse) throws IOException {
 return read(in, reuse);
   }
 {code}
 The new methods could then be used in the FST in the readNextRealArc() method 
 passing in the output of the reused Arc. For most inputs they could even just 
 invoke the original read(in) method.
 If you should decide to make that change I'd be happy to supply a patch 
 and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

2014-04-11 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966443#comment-13966443
 ] 

Christian Ziech commented on LUCENE-5584:
-

Modifying the existing ByteSequenceOutputs and CharSequenceOutputs to actually 
modify the output passed in upon read turns out to be rather complex since a 
lot in lucene indexing is actually relying on the immutability of arc outputs. 
Would it be ok for you guys to have a new mutable ByteSequenceOutput that 
could then be used for e.g. the AnalyzingSuggester? Then the patch would mostly 
be adding the extension points required and only using them in the special 
cases that would benefit from it.

 Allow FST read method to also recycle the output value when traversing FST
 --

 Key: LUCENE-5584
 URL: https://issues.apache.org/jira/browse/LUCENE-5584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.7.1
Reporter: Christian Ziech

 The FST class heavily reuses Arc instances when traversing the FST. The 
 output of an Arc however is not reused. This can especially be important when 
 traversing large portions of a FST and using the ByteSequenceOutputs and 
 CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
 node read (which has an output).
 In our use case we intersect a lucene Automaton with a FSTBytesRef much 
 like it is done in 
 org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
 since the Automaton and the FST are both rather large tens or even hundreds 
 of thousands of temporary byte array objects are created.
 One possible solution to the problem would be to change the 
 org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
 you don't want to change the existing methods for compatibility):
 {code}
   /** Decode an output value previously written with {@link
*  #write(Object, DataOutput)} reusing the object passed in if possible */
   public abstract T read(DataInput in, T reuse) throws IOException;
   /** Decode an output value previously written with {@link
*  #writeFinalOutput(Object, DataOutput)}.  By default this
*  just calls {@link #read(DataInput)}. This tries to  reuse the object   
*  passed in if possible */
   public T readFinalOutput(DataInput in, T reuse) throws IOException {
 return read(in, reuse);
   }
 {code}
 The new methods could then be used in the FST in the readNextRealArc() method 
 passing in the output of the reused Arc. For most inputs they could even just 
 invoke the original read(in) method.
 If you should decide to make that change I'd be happy to supply a patch 
 and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

2014-04-10 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965215#comment-13965215
 ] 

Christian Ziech commented on LUCENE-5584:
-

Thx for the very quick and helpful replies. It seems that I owe you some more 
hard and concrete information on our use case, what we exactly do and our 
environment.
About the environment - the tests were run with
{quote}
java version 1.7.0_45
OpenJDK Runtime Environment (rhel-2.4.3.4.el6_5-x86_64 u45-b15)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
{quote}
on a CentOS 6.5. Our vm options don't enable the tlab right now but I'm 
definitely consider using it for other reasons. Currently we are running with 
the following (gc relevant) arguments: -Xmx6g -XX:MaxNewSize=700m 
-XX:+UseConcMarkSweepGC -XX:MaxDirectMemorySize=35g. 

I'm not so much worried about the get performance although that could be 
improved as well. We are using lucenes LevenshteinAutomata class to generate a 
couple of Levenshtein automatons with edit distance 1 or 2 (one for each search 
term), build the union of them and intersect them with our FST using a modified 
version of the method 
org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() which 
uses a callback method to push every matched entry instead of returning the 
whole list of paths (for efficiency reasons as well: we don't actually need the 
byte arrays but we want to parse them into a value object, hence reusing the 
output byte array is ok for us).
Our FST has about 500M entires and each entry has a value of approx. 10-20 
bytes. That produces for a random query with 4 terms (and hence a union of 4 
levenshtein automatons) an amount of ~2M visited nodes with output (hence 2M 
created temporary byte []) and a total size ~7.5M for the temporary byte arrays 
(+ the overhead per instance). In that experiment I matched about 10k terms in 
the FST. Those numbers are taking into account that we already used our own add 
implementation that writes to always the same BytesRef instance when adding 
outputs.
The overall impact on the GC and also the execution speed of the method was 
rather significant in total - I can try to dig up numbers for that but they 
would be rather application specific.

Does this help and answers all the questions so far?

Btw: Experimenting a little with the change I noticed that things may be a 
slightly more complicated since the output of a node is often overwritten with 
NO_OUTPUT from the Outputs - so that method would need to recycle the current 
output as well if possible but that may have interesting side effects - but 
hopefully that should be manageable.

 Allow FST read method to also recycle the output value when traversing FST
 --

 Key: LUCENE-5584
 URL: https://issues.apache.org/jira/browse/LUCENE-5584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.7.1
Reporter: Christian Ziech

 The FST class heavily reuses Arc instances when traversing the FST. The 
 output of an Arc however is not reused. This can especially be important when 
 traversing large portions of a FST and using the ByteSequenceOutputs and 
 CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
 node read (which has an output).
 In our use case we intersect a lucene Automaton with a FSTBytesRef much 
 like it is done in 
 org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
 since the Automaton and the FST are both rather large tens or even hundreds 
 of thousands of temporary byte array objects are created.
 One possible solution to the problem would be to change the 
 org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
 you don't want to change the existing methods for compatibility):
 {code}
   /** Decode an output value previously written with {@link
*  #write(Object, DataOutput)} reusing the object passed in if possible */
   public abstract T read(DataInput in, T reuse) throws IOException;
   /** Decode an output value previously written with {@link
*  #writeFinalOutput(Object, DataOutput)}.  By default this
*  just calls {@link #read(DataInput)}. This tries to  reuse the object   
*  passed in if possible */
   public T readFinalOutput(DataInput in, T reuse) throws IOException {
 return read(in, reuse);
   }
 {code}
 The new methods could then be used in the FST in the readNextRealArc() method 
 passing in the output of the reused Arc. For most inputs they could even just 
 invoke the original read(in) method.
 If you should decide to make that change I'd be happy to supply a patch 
 and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

2014-04-10 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965316#comment-13965316
 ] 

Christian Ziech commented on LUCENE-5584:
-

Trying to assemble the patch I came across the FST.Arc.copyFrom(Arc) method 
which unfortunately seems to implicitly assumes that the output of a node is 
immutable (which it would not be any longer). Is this immutability intended? If 
not I think that copyFrom() method would need to be moved into the FST class so 
that it can make use of the Outputs of the FST to clone the output of the 
copied arc if it is mutable ... however that would increase the size of the 
patch and possibly impact other users too ...

 Allow FST read method to also recycle the output value when traversing FST
 --

 Key: LUCENE-5584
 URL: https://issues.apache.org/jira/browse/LUCENE-5584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.7.1
Reporter: Christian Ziech

 The FST class heavily reuses Arc instances when traversing the FST. The 
 output of an Arc however is not reused. This can especially be important when 
 traversing large portions of a FST and using the ByteSequenceOutputs and 
 CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
 node read (which has an output).
 In our use case we intersect a lucene Automaton with a FSTBytesRef much 
 like it is done in 
 org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
 since the Automaton and the FST are both rather large tens or even hundreds 
 of thousands of temporary byte array objects are created.
 One possible solution to the problem would be to change the 
 org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
 you don't want to change the existing methods for compatibility):
 {code}
   /** Decode an output value previously written with {@link
*  #write(Object, DataOutput)} reusing the object passed in if possible */
   public abstract T read(DataInput in, T reuse) throws IOException;
   /** Decode an output value previously written with {@link
*  #writeFinalOutput(Object, DataOutput)}.  By default this
*  just calls {@link #read(DataInput)}. This tries to  reuse the object   
*  passed in if possible */
   public T readFinalOutput(DataInput in, T reuse) throws IOException {
 return read(in, reuse);
   }
 {code}
 The new methods could then be used in the FST in the readNextRealArc() method 
 passing in the output of the reused Arc. For most inputs they could even just 
 invoke the original read(in) method.
 If you should decide to make that change I'd be happy to supply a patch 
 and/or tests for the feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

2014-04-09 Thread Christian Ziech (JIRA)

Christian Ziech created LUCENE-5584:
---

 Summary: Allow FST read method to also recycle the output value 
when traversing FST
 Key: LUCENE-5584
 URL: https://issues.apache.org/jira/browse/LUCENE-5584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.7.1
Reporter: Christian Ziech


The FST class heavily reuses Arc instances when traversing the FST. The output 
of an Arc however is not reused. This can especially be important when 
traversing large portions of a FST and using the ByteSequenceOutputs and 
CharSequenceOutputs. Those classes create a new byte[] or char[] for every node 
read (which has an output).
In our use case we intersect a lucene Automaton with a FSTBytesRef much like 
it is done in 
org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
since the Automaton and the FST are both rather large tens or even hundreds of 
thousands of temporary byte array objects are created.

One possible solution to the problem would be to change the 
org.apache.lucene.util.fst.Outputs class to have two additional methods (if you 
don't want to change the existing methods for compatibility):
{code}

  /** Decode an output value previously written with {@link
   *  #write(Object, DataOutput)} reusing the object passed in if possible */
  public abstract T read(DataInput in, T reuse) throws IOException;

  /** Decode an output value previously written with {@link
   *  #writeFinalOutput(Object, DataOutput)}.  By default this
   *  just calls {@link #read(DataInput)}. This tries to  reuse the object   
   *  passed in if possible */
  public T readFinalOutput(DataInput in, T reuse) throws IOException {
return read(in, reuse);
  }
{code}
The new methods could then be used in the FST in the readNextRealArc() method 
passing in the output of the reused Arc. For most inputs they could even just 
invoke the original read(in) method.

If you should decide to make that change I'd be happy to supply a patch and/or 
tests for feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5584) Allow FST read method to also recycle the output value when traversing FST

2014-04-09 Thread Christian Ziech (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Ziech updated LUCENE-5584:


Description: 
The FST class heavily reuses Arc instances when traversing the FST. The output 
of an Arc however is not reused. This can especially be important when 
traversing large portions of a FST and using the ByteSequenceOutputs and 
CharSequenceOutputs. Those classes create a new byte[] or char[] for every node 
read (which has an output).
In our use case we intersect a lucene Automaton with a FSTBytesRef much like 
it is done in 
org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
since the Automaton and the FST are both rather large tens or even hundreds of 
thousands of temporary byte array objects are created.

One possible solution to the problem would be to change the 
org.apache.lucene.util.fst.Outputs class to have two additional methods (if you 
don't want to change the existing methods for compatibility):
{code}

  /** Decode an output value previously written with {@link
   *  #write(Object, DataOutput)} reusing the object passed in if possible */
  public abstract T read(DataInput in, T reuse) throws IOException;

  /** Decode an output value previously written with {@link
   *  #writeFinalOutput(Object, DataOutput)}.  By default this
   *  just calls {@link #read(DataInput)}. This tries to  reuse the object   
   *  passed in if possible */
  public T readFinalOutput(DataInput in, T reuse) throws IOException {
return read(in, reuse);
  }
{code}
The new methods could then be used in the FST in the readNextRealArc() method 
passing in the output of the reused Arc. For most inputs they could even just 
invoke the original read(in) method.

If you should decide to make that change I'd be happy to supply a patch and/or 
tests for the feature.

  was:
The FST class heavily reuses Arc instances when traversing the FST. The output 
of an Arc however is not reused. This can especially be important when 
traversing large portions of a FST and using the ByteSequenceOutputs and 
CharSequenceOutputs. Those classes create a new byte[] or char[] for every node 
read (which has an output).
In our use case we intersect a lucene Automaton with a FSTBytesRef much like 
it is done in 
org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
since the Automaton and the FST are both rather large tens or even hundreds of 
thousands of temporary byte array objects are created.

One possible solution to the problem would be to change the 
org.apache.lucene.util.fst.Outputs class to have two additional methods (if you 
don't want to change the existing methods for compatibility):
{code}

  /** Decode an output value previously written with {@link
   *  #write(Object, DataOutput)} reusing the object passed in if possible */
  public abstract T read(DataInput in, T reuse) throws IOException;

  /** Decode an output value previously written with {@link
   *  #writeFinalOutput(Object, DataOutput)}.  By default this
   *  just calls {@link #read(DataInput)}. This tries to  reuse the object   
   *  passed in if possible */
  public T readFinalOutput(DataInput in, T reuse) throws IOException {
return read(in, reuse);
  }
{code}
The new methods could then be used in the FST in the readNextRealArc() method 
passing in the output of the reused Arc. For most inputs they could even just 
invoke the original read(in) method.

If you should decide to make that change I'd be happy to supply a patch and/or 
tests for feature.


 Allow FST read method to also recycle the output value when traversing FST
 --

 Key: LUCENE-5584
 URL: https://issues.apache.org/jira/browse/LUCENE-5584
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Affects Versions: 4.7.1
Reporter: Christian Ziech

 The FST class heavily reuses Arc instances when traversing the FST. The 
 output of an Arc however is not reused. This can especially be important when 
 traversing large portions of a FST and using the ByteSequenceOutputs and 
 CharSequenceOutputs. Those classes create a new byte[] or char[] for every 
 node read (which has an output).
 In our use case we intersect a lucene Automaton with a FSTBytesRef much 
 like it is done in 
 org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and 
 since the Automaton and the FST are both rather large tens or even hundreds 
 of thousands of temporary byte array objects are created.
 One possible solution to the problem would be to change the 
 org.apache.lucene.util.fst.Outputs class to have two additional methods (if 
 you don't want to change the existing methods for compatibility):
 {code}
   /** Decode an output value previously written with {@link
*

[jira] [Commented] (LUCENE-4930) Lucene's use of WeakHashMap at index time prevents full use of cores on some multi-core machines, due to contention

2013-04-12 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630329#comment-13630329
 ] 

Christian Ziech commented on LUCENE-4930:
-

I have checked the java we are using and it is the lastest java 6 openjdk  
available for ubuntu 12.04 LTS (6b24-1.11.5-0ubuntu1~12.04.1). 
There is a newer one available for 12.10 (6b27-1.12.3-0ubuntu1~12.10.1) but 
still the issue is also not fixed in that version ... so there is no way around 
that code without upgrading to java 7 ...

The code snippet from the reference queue looks like follows:
{noformat}
 89 /**
 90  * Polls this queue to see if a reference object is available.  If 
one is
 91  * available without further delay then it is removed from the 
queue and
 92  * returned.  Otherwise this method immediately returns 
ttnull/tt.
 93  *
 94  * @return  A reference object, if one was immediately available,
 95  *  otherwise codenull/code
 96  */
 97 public Reference? extends T poll() {
 98 synchronized (lock) {
 99 return reallyPoll();
100 }
101 }
{noformat}

So it seems that Uwe is right about our java version not doing the double 
checking here before actually entering the synchronized block.

However I'm not really sure if I understand the reason for lucene to actually 
use a WeakKeyHashMap here:
I may be wrong but wouldn't that reap actually only happen when the Interface 
class itself is unloaded? That should be an extremely rare thing, or? If I 
understand the purpose of that code correctly it is meant to prevent a memory 
wasting for cases where the user does incremental indexing from time to time. 
In that case the attribute source would prevent the interface class and 
implementation class from being garbage collected in the mean time. But is that 
case actually really worth the effort (I don't know how big the memory 
footprint for an Attribute implementation _class_ usually is)? I mean that 
would only affect the static fields here (and in plain lucene I could not find 
many of those) ...



 Lucene's use of WeakHashMap at index time prevents full use of cores on some 
 multi-core machines, due to contention
 ---

 Key: LUCENE-4930
 URL: https://issues.apache.org/jira/browse/LUCENE-4930
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 4.2
 Environment: Dell blade system with 16 cores
Reporter: Karl Wright
 Attachments: thread_dump.txt


 Our project is not optimally using full processing power during under 
 indexing load on Lucene 4.2.0.  The reason is the 
 AttributeSource.addAttribute() method, which goes through a WeakHashMap 
 synchronizer, which is apparently single-threaded for a significant amount of 
 time.  Have a look at the following trace:
 pool-1-thread-28 prio=10 tid=0x7f47fc104800 nid=0x672b waiting for 
 monitor entry [0x7f47d19ed000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at java.lang.ref.ReferenceQueue.poll(ReferenceQueue.java:98)
 - waiting to lock 0x0005c5cd9988 (a 
 java.lang.ref.ReferenceQueue$Lock)
 at 
 org.apache.lucene.util.WeakIdentityMap.reap(WeakIdentityMap.java:189)
 at org.apache.lucene.util.WeakIdentityMap.get(WeakIdentityMap.java:82)
 at 
 org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.getClassForInterface(AttributeSource.java:74)
 at 
 org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.createAttributeInstance(AttributeSource.java:65)
 at 
 org.apache.lucene.util.AttributeSource.addAttribute(AttributeSource.java:271)
 at 
 org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:107)
 at 
 org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:254)
 at 
 org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
 at 
 org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
 at 
 org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)
 at 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1148)
 at 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1129)
 …
 We’ve had to make significant changes to the way we were indexing in order to 
 not hit this issue as much, such as indexing using TokenStreams which we 
 reuse, when it would have been more convenient to index with just tokens.  
 (The reason is that Lucene internally creates TokenStream

[jira] [Commented] (LUCENE-4930) Lucene's use of WeakHashMap at index time prevents full use of cores on some multi-core machines, due to contention

2013-04-12 Thread Christian Ziech (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630542#comment-13630542
 ] 

Christian Ziech commented on LUCENE-4930:
-

{quote}
The issue is not class unloading in your own application while it is running. 
The VM will never do this. It will only unload classes, when the ClassLoader is 
released. This happens e.g. when you redeploy your webapplication in your Jetty 
or Tomcat container or (and this is the most important reason) when you reload 
Solr cores: If you have a custom analyzer JAR file in your plugins directory 
that uses custom attributes (like lucene-kuromoji.jar Japanese Analyzer), your 
would have a memory leak. Solr loads plugins in its own classloader. If you 
restart a core it reinitializes its plugins and releases the old classloader. 
If the AttributeSource would refer to these classes, they could never be 
unloaded. The same happens if you have a webapp that uses a lucene-core.jar 
file from outside the webapp (e.g. from Ubuntu repository in /usr/lib), but has 
own analyzers shipped in the webapp. In that case, the classes could not be 
unloaded on webapp shutdown.
The WeakIdentityMap prevents this big resource leak (permgen issue). If you 
wonder: The values in the map also have a WeakReference, because the key's weak 
reference and the Map.Entry is only removed when you actually call get() on the 
map. If you unload the webapp, nobody calls get() anymore, so all Map.Entry 
would refer to the classes and are never removed.
One optimization might be possible: As the number of classes in this map is 
very low and the important thing is to release the class reference when no 
longer needed, we could add an option to WeakIdentityMap to make reap() a 
no-op. This would keep the WeakReference and Map.Entrys in the map, but the 
classes could get freed. The small overhead (you can count the number of 
entries on your fingers) would be minimal and the lost WeakReferences in the 
map would be no problem.
Another approach would be to make DefaultAttributeSource have a lookup table 
(without weak keys) on all Lucene-Internal attributes (which are the only ones 
actually used by IndexWriter). I would prefer this approach.
{quote}

I totally understood the problem with the unloading of the keys (but I think I 
worded it badly) - I just did not expect it to be grave since every reload 
would only leave behind two dead weak references and the related map entry. 

A possibly better option than making reap a no-op could be to only reap on put. 
I mean one usually invokes get() but once that event of unloading an interface 
actually happens and something new needs to be added one would reap the old 
keys (in worst case perhaps one unloading later).

I also tried to think of a good way to have one AttributeFactory per class 
loader (I mean you only really have a problem if the class loader that loads 
the interface class is a child of the class loader that did load the the 
AttributeFactory class) but couldn't find one.


 Lucene's use of WeakHashMap at index time prevents full use of cores on some 
 multi-core machines, due to contention
 ---

 Key: LUCENE-4930
 URL: https://issues.apache.org/jira/browse/LUCENE-4930
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 4.2
 Environment: Dell blade system with 16 cores
Reporter: Karl Wright
 Attachments: thread_dump.txt


 Our project is not optimally using full processing power during under 
 indexing load on Lucene 4.2.0.  The reason is the 
 AttributeSource.addAttribute() method, which goes through a WeakHashMap 
 synchronizer, which is apparently single-threaded for a significant amount of 
 time.  Have a look at the following trace:
 pool-1-thread-28 prio=10 tid=0x7f47fc104800 nid=0x672b waiting for 
 monitor entry [0x7f47d19ed000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at java.lang.ref.ReferenceQueue.poll(ReferenceQueue.java:98)
 - waiting to lock 0x0005c5cd9988 (a 
 java.lang.ref.ReferenceQueue$Lock)
 at 
 org.apache.lucene.util.WeakIdentityMap.reap(WeakIdentityMap.java:189)
 at org.apache.lucene.util.WeakIdentityMap.get(WeakIdentityMap.java:82)
 at 
 org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.getClassForInterface(AttributeSource.java:74)
 at 
 org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.createAttributeInstance(AttributeSource.java:65)
 at 
 org.apache.lucene.util.AttributeSource.addAttribute(AttributeSource.java:271)
 at

39 matches

Mail list logo