[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035223#comment-13035223
 ] 

Shai Erera commented on LUCENE-3102:


There are two things left to do:

(1) Use bit set instead of int[] for docIDs. If we do this, then it means the 
Collector cannot support out-of-order collections (which is not a big deal 
IMO). It also means for large indexes, we might consume more RAM than int[].

(2) Allow this Collector to stand on its own, w/o necessarily wrapping another 
Collector. There are several ways we can achieve that:
* Take a 'null' Collector and check other != null. Adds an 'if' but not a big 
deal IMO. Also, acceptDocsOutOfOrder will have to either return false (or 
true), or we take that as a parameter.
* Take a 'null' Collector and set this.other to a private static instance of a 
NoOpCollector. We'll still be delegating calls to it, but hopefully it won't be 
expensive. Same issue w/ out-of-order
* Create two specialized variants of CachingCollector.

Personally I'm not too much in favor of the last option - too much code dup for 
not much gain.

The option I like the most is the 2nd (introducing a NoOpCollector). We can 
even introduce it as a public static member of CachingCollector and let users 
decide if they want to use it or not. For ease of use, we can still allow 
'null' to be passed to create().

What do you think?

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, 
 LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035283#comment-13035283
 ] 

Michael McCandless commented on LUCENE-3102:


The committed CHANGES has typo (reply should be replay).

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, 
 LUCENE-3102.patch, LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035285#comment-13035285
 ] 

Michael McCandless commented on LUCENE-3102:


Patch to allow no wrapped collector looks good!  I wonder/hope hotspot can 
realize those method calls are no-ops...

Maybe change TestGrouping to randomly use this ctor?  Ie, randomly, you can use 
caching collector (not wrapped), then call its replay method twice (once 
against 1st pass, then against 2nd pass, collectors), and then assert results 
like normal.  This is also a good verification that replay works twice...

On the OBS, it makes me nervous to just always do this; I'd rather have it 
cutover at some point?  Or perhaps it's an expert optional arg to create, 
whether it should back w/ OBS vs int[]?

Or, ideally... we make a bit set impl that does this all under the hood (uses 
int[] when there are few docs, and ugprades to OBS once there are enough to 
justify it...), then we can just use that bit set here.

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, 
 LUCENE-3102.patch, LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-18 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035344#comment-13035344
 ] 

Shai Erera commented on LUCENE-3102:


bq. The committed CHANGES has typo (reply should be replay).

Thanks, will include it in the next commit.

bq. I'd rather have it cutover at some point

This can only be done if out-of-order collection wasn't done so far, because 
otherwise, cutting to OBS will take cached doc IDs and scores out of sync.

bq. we make a bit set impl that does this all under the hood (uses int[] when 
there are few docs, and ugprades to OBS once there are enough to justify 
it...)

That's a good idea. I think we should leave the OBS stuff for another issue. 
See first how this performs and optimize only if needed.

I'll take a look at TestGrouping.

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, 
 LUCENE-3102.patch, LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035637#comment-13035637
 ] 

Michael McCandless commented on LUCENE-3102:


Thanks Shai -- this is awesome progress!

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, 
 LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch, 
 LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034841#comment-13034841
 ] 

Michael McCandless commented on LUCENE-3102:


Patch looks great!  But, can we rename curupto - curUpto (and same for 
curbase)?  Ie, so it matches the other camelCaseVariables we have here...

Thank you!

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, 
 LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-17 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035188#comment-13035188
 ] 

Shai Erera commented on LUCENE-3102:


Committed revision 1104680 (3x).
Committed revision 1104683 (trunk).

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, 
 LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034091#comment-13034091
 ] 

Michael McCandless commented on LUCENE-3102:


Patch looks great Shai -- +1 to commit!!

Yes that is very sneaky about the private fields in inner/outer classes -- it's 
good you added a comment explaining it!

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102.patch, LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034283#comment-13034283
 ] 

Shai Erera commented on LUCENE-3102:


Committed revision 1103870 (3x).
Committed revision 1103872 (trunk).

What's committed:
* Move CachingCollector to core
* Fix bugs
* Add TestCachingCollector
* Some refactoring

Moving on to next proposed changes.

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/grouping
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102.patch, LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033784#comment-13033784
 ] 

Michael McCandless commented on LUCENE-3102:


Great catches Shai -- thanks for the thorough review!

bq. Since the wrapped Collector may support out-of-order collection, the 
document IDs cached may be out-of-order (depends on the Query) and thus 
replay(Collector) will forward document IDs out-of-order to a Collector that 
may not support it.

Ahh... maybe we throw IllegalArgExc if the replay'd collector requires
in-order but the first pass collector didn't?

bq. It does not clear cachedScores + cachedSegs upon exceeding RAM limits

Hmm, I think it does clear cachedScores?  (But not cachedSegs).

bq. I think that instead of comparing curScores to null, in order to determine 
if scores are requested, we should have a specific boolean - for clarity

That sounds great!

bq. This check if (base + nextLength  maxDocsToCache) (line 168) can be 
relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to 
try and cache them?

Oh you mean for the last chunk we could alloc right up to the limit?
Good!

bq. The TODO in line 64 (having Collector specify needsScores()) – why do we 
need that if CachingCollector ctor already takes a boolean cacheScores? I 
think it's better defined explicitly than implicitly?

Yes, I think we should keep the explicit boolean (cacheScores), but eg
you could mess up (pass cacheScores = false but then pass a collector
that calls .score()) -- that's why I added to TODO.  Ie, it'd be nice
if we could verify that the collector agrees we don't need scores.

I think there were other places in Lucene where knowing this up front
could help us... can't remember the details.

bq. Let's introduce a factory method for creating a specialized version if 
scoring is requested / not (i.e., impl the TODO in line 189)

+1

bq. I think it's a useful collector, which stands on its own and not specific 
to grouping. Can we move it to core?

+1

{quote}
How about using OpenBitSet instead of int[] for doc IDs?
If the number of hits is big, we'd gain some RAM back, and be able to cache 
more entries
NOTE: OpenBitSet can only be used for in-order collection only. So we can use 
that if the wrapped Collector does not support out-of-order
{quote}

Hmm but if the number of hits is small we spend un-needed RAM/CPU,
but, then that tradeoff is maybe OK?  I'm just worried about indices
w/ lots of docs... we could also upgrade to a bit set part way
through, since it's so clear where the cutoff is.

bq. Do you think we can modify this Collector to not necessarily wrap another 
Collector? We have such Collector which stores (in-memory) all matching doc IDs 
+ scores (if required). Those are later fed into several processes that operate 
on them (e.g. fetch more info from the index etc.). I am thinking, we can make 
CachingCollector optionally wrap another Collector and then someone can reuse 
it by setting RAM limit to unlimited (we should have a constant for that) in 
order to simply collect all matching docs + scores.

I'd actually rather not have the constant -- ie, I don't want to make
it easy to be unlimited?  It seems too dangerous... I'd rather your
code has to spell out 10*1024 so you realize you're saying 10 GB (for
example).

bq. I think a set of dedicated unit tests for this class alone would be good.

+1

Awesome feedback!!  Are you planning to work up a patch for these...?


 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly 

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033787#comment-13033787
 ] 

Shai Erera commented on LUCENE-3102:


bq. Hmm, I think it does clear cachedScores? (But not cachedSegs).

Sorry, I meant curScores.

bq. +1 (on move to core)

I will start w/ svn mv this to core, so that later patches on this issue will 
be applied easily. Moving to core has nothing to do w/ resolving the other 
issues.

bq. we could also upgrade to a bit set part way through, since it's so clear 
where the cutoff is

you're right, but cutting off to OBS is dangerous, b/c by doing that we:
# Suddenly halt search when we create and populate OBS
# Lose the ability to support out-of-order docs (in fact, depending on the mode 
and how the query was executed so far, we might not even be able to do the 
cut-off at all).
So I prefer that we make that decision up front, perhaps through another 
parameter to the factory method.

bq. but eg you could mess up (pass cacheScores = false but then pass a 
collector that calls .score())

Oh I see, so this TODO is about the use of cachedScorer (vs. just delegating 
setScorer to other). I agree.

BTW, this version of cachedScorer is very optimized and clean, but we do have 
ScoreCachingWrappingScorer which achieves the same goal, only w/ 1-2 more 'if'. 
Perhaps we should reuse it? But then again, for the purpose of this Collector, 
cachedScorer is the most optimized it can be.

bq. ie, I don't want to make it easy to be unlimited? It seems too dangerous... 
I'd rather your code has to spell out 10*1024 so you realize you're saying 10 
GB (for example).

What if you run w/ 16GB Heap? ;)

But ok, I don't mind, we can spell it out clearly in the jdocs.

bq. Are you planning to work up a patch for these...?

I think so. I'll try to squeeze it in my schedule in the next couple of days. 
If I see I don't get to it, I'll update the issue.


 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033788#comment-13033788
 ] 

Shai Erera commented on LUCENE-3102:


bq. I will start w/ svn mv this to core

Or, we can iterate on all the changes here, then do the svn move as part of the 
commit. Both work for me.

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org