[ https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654055#action_12654055 ]
Michael McCandless commented on LUCENE-831: ------------------------------------------- [Note: my understanding of this area in general, and this patch in particular, is still rather spotty... so please correct my misconceptions in what follows...] This change is a great improvement, since the cache management would be per-IndexReader, and more public so that you could see what's cached, access the cache via the reader, swap in your own cache management, etc. But I'm concerned, because this change continues the "materialize massive array for entire index" approach, which is the major remaining cost when (re)opening readers. EG, isMergable()/mergeData() methods build up the whole array from sub readers. What would it take to never require materializing the full array for the index, for Lucene's internal purposes (external users may continue to do so if they want)? Ie, leave the array bound to the "leaf" IndexReader (ie, SegmentReader). It was briefly touched on here: https://issues.apache.org/jira/browse/LUCENE-1458?focusedCommentId=12650964#action_12650964 I realize this is a big change, but I think we need to get there eventually. EG I can see in this patch that MultiReader & MultiSegmentReader do expose a CacheData that has get and get2 (why do we have get2?) that delegate to child readers, which is good, but it's not good that they return Object (requires casting for every lookup). We don't have per-atomic-type variants? Couldn't we expose eg an IntData class (and all other types) that has int get(docID) abstract method, that delegate to child readers? (I'm also generally confused by why we have the per-atomic-type switching happening in CacheKey subclasses and not CacheData.) Then... and probably the hardest thing to fix here: for all the comparators we now materialize the full array. I realize we use the full array when sorting during a search of an IndexSearcher(MultiReader(...)), because FieldSortedHitQueue is called for every doc visited and must be able to quickly make its comparison. However, stepping back, this is poor approach. We should instead be doing what MultiSearcher does, which is gather top results per-sub-reader, and then merge-sort the results. At that point, to do the merge, we only need actual field values for those docs in the top N. If we could fix field-sorting like that (and I'm hazy on exactly how to do so), I think Lucene internally would then never need the full array? This change also adds USE_OA_SORT, which is scary to me because Object overhead per doc can be exceptionally costly. Why do we need to even offer that? > Complete overhaul of FieldCache API/Implementation > -------------------------------------------------- > > Key: LUCENE-831 > URL: https://issues.apache.org/jira/browse/LUCENE-831 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Hoss Man > Fix For: 3.0 > > Attachments: fieldcache-overhaul.032208.diff, > fieldcache-overhaul.diff, fieldcache-overhaul.diff, > LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, > LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, > LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch > > > Motivation: > 1) Complete overhaul the API/implementation of "FieldCache" type things... > a) eliminate global static map keyed on IndexReader (thus > eliminating synch block between completley independent IndexReaders) > b) allow more customization of cache management (ie: use > expiration/replacement strategies, disk backed caches, etc) > c) allow people to define custom cache data logic (ie: custom > parsers, complex datatypes, etc... anything tied to a reader) > d) allow people to inspect what's in a cache (list of CacheKeys) for > an IndexReader so a new IndexReader can be likewise warmed. > e) Lend support for smarter cache management if/when > IndexReader.reopen is added (merging of cached data from subReaders). > 2) Provide backwards compatibility to support existing FieldCache API with > the new implementation, so there is no redundent caching as client code > migrades to new API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]