[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

Michael McCandless (JIRA) Sat, 06 Dec 2008 03:52:09 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654055#action_12654055
 ]


Michael McCandless commented on LUCENE-831:
-------------------------------------------


[Note: my understanding of this area in general, and this patch in
particular, is still rather spotty... so please correct my
misconceptions in what follows...]

This change is a great improvement, since the cache management would
be per-IndexReader, and more public so that you could see what's
cached, access the cache via the reader, swap in your own cache
management, etc.

But I'm concerned, because this change continues the "materialize
massive array for entire index" approach, which is the major remaining
cost when (re)opening readers.  EG, isMergable()/mergeData() methods
build up the whole array from sub readers.

What would it take to never require materializing the full array for
the index, for Lucene's internal purposes (external users may continue
to do so if they want)?  Ie, leave the array bound to the "leaf"
IndexReader (ie, SegmentReader).  It was briefly touched on here:

  
https://issues.apache.org/jira/browse/LUCENE-1458?focusedCommentId=12650964#action_12650964

I realize this is a big change, but I think we need to get there
eventually.

EG I can see in this patch that MultiReader & MultiSegmentReader do
expose a CacheData that has get and get2 (why do we have get2?) that
delegate to child readers, which is good, but it's not good that they
return Object (requires casting for every lookup).  We don't have
per-atomic-type variants?  Couldn't we expose eg an IntData class (and
all other types) that has int get(docID) abstract method, that
delegate to child readers?  (I'm also generally confused by why we
have the per-atomic-type switching happening in CacheKey subclasses
and not CacheData.)

Then... and probably the hardest thing to fix here: for all the
comparators we now materialize the full array.  I realize we use the
full array when sorting during a search of an
IndexSearcher(MultiReader(...)), because FieldSortedHitQueue is called
for every doc visited and must be able to quickly make its comparison.

However, stepping back, this is poor approach.  We should instead be
doing what MultiSearcher does, which is gather top results
per-sub-reader, and then merge-sort the results.  At that point, to do
the merge, we only need actual field values for those docs in the top
N.

If we could fix field-sorting like that (and I'm hazy on exactly how
to do so), I think Lucene internally would then never need the full
array?

This change also adds USE_OA_SORT, which is scary to me because Object
overhead per doc can be exceptionally costly.  Why do we need to even
offer that?


> Complete overhaul of FieldCache API/Implementation
> --------------------------------------------------
>
>                 Key: LUCENE-831
>                 URL: https://issues.apache.org/jira/browse/LUCENE-831
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Hoss Man
>             Fix For: 3.0
>
>         Attachments: fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
>     a) eliminate global static map keyed on IndexReader (thus
>         eliminating synch block between completley independent IndexReaders)
>     b) allow more customization of cache management (ie: use 
>         expiration/replacement strategies, disk backed caches, etc)
>     c) allow people to define custom cache data logic (ie: custom
>         parsers, complex datatypes, etc... anything tied to a reader)
>     d) allow people to inspect what's in a cache (list of CacheKeys) for
>         an IndexReader so a new IndexReader can be likewise warmed. 
>     e) Lend support for smarter cache management if/when
>         IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
>     the new implementation, so there is no redundent caching as client code
>     migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

Reply via email to