[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Shai Erera (JIRA) Sun, 15 May 2011 10:08:28 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033787#comment-13033787
 ]


Shai Erera commented on LUCENE-3102:
------------------------------------

bq. Hmm, I think it does clear cachedScores? (But not cachedSegs).

Sorry, I meant curScores.

bq. +1 (on move to core)

I will start w/ "svn mv" this to core, so that later patches on this issue will 
be applied easily. Moving to core has nothing to do w/ resolving the other 
issues.

bq. we could also "upgrade" to a bit set part way through, since it's so clear 
where the cutoff is

you're right, but cutting off to OBS is dangerous, b/c by doing that we:
# Suddenly halt search when we create and populate OBS
# Lose the ability to support out-of-order docs (in fact, depending on the mode 
and how the query was executed so far, we might not even be able to do the 
cut-off at all).
So I prefer that we make that decision up front, perhaps through another 
parameter to the factory method.

bq. but eg you could mess up (pass cacheScores = false but then pass a 
collector that calls .score())

Oh I see, so this TODO is about the use of cachedScorer (vs. just delegating 
setScorer to other). I agree.

BTW, this version of cachedScorer is very optimized and clean, but we do have 
ScoreCachingWrappingScorer which achieves the same goal, only w/ 1-2 more 'if'. 
Perhaps we should reuse it? But then again, for the purpose of this Collector, 
cachedScorer is the most optimized it can be.

bq. ie, I don't want to make it easy to be unlimited? It seems too dangerous... 
I'd rather your code has to spell out 10*1024 so you realize you're saying 10 
GB (for example).

What if you run w/ 16GB Heap? ;)

But ok, I don't mind, we can spell it out clearly in the jdocs.

bq. Are you planning to work up a patch for these...?

I think so. I'll try to squeeze it in my schedule in the next couple of days. 
If I see I don't get to it, I'll update the issue.


> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the 
> document IDs cached may be out-of-order (depends on the Query) and thus 
> replay(Collector) will forward document IDs out-of-order to a Collector that 
> may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine 
> if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be 
> relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
> maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
> to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
> need that if CachingCollector ctor already takes a boolean "cacheScores"? I 
> think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if 
> scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific 
> to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to 
> cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
> use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another 
> Collector? We have such Collector which stores (in-memory) all matching doc 
> IDs + scores (if required). Those are later fed into several processes that 
> operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
> can make CachingCollector *optionally* wrap another Collector and then 
> someone can reuse it by setting RAM limit to unlimited (we should have a 
> constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Reply via email to