On Sun, Sep 12, 2010 at 13:46, Michael McCandless <[email protected]> wrote: > Having hooks to enable an app to manage its own "external, private > stuff associated w/ each segment reader" would be useful and it's been > asked for in the past. However, since we've now opened up > SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app > already do this w/o core API changes? > > I know Earwin has built a whole system like this on top of Lucene -- > Earwin how did you do that...? Did you make core changes to > Lucene...?
I did implement generic plugins for a SR/MSR and friends over 2.9-trunk lucene, and that's a core change indeed. They didn't handle IW.getReader case, and I started working on that (along with a major IR.clone/reopen cleanup - LUCENE-2355), but got sidetracked. There's still hope I get back to them in a nearest couple of months :) > A custom Codec should be an excellent way to handle the specific use > cache (caching certain postings) -- by doing it as a Codec, any time > anything in Lucene needs to tap into that posting (query scorers, > filters, merging, applying deletes, etc), it hits this cache. You > could model it like PulsingCodec, which wraps any other Codec but > handles the low-freq ones itself. If you do it externally how would > core use of postings hit it? (Or was that not the intention?) > > I don't understand the filter use-case... the CachingWrapperFilter > already caches per-segment, so that reopen is efficient? How would an > external cache (built on these hooks) be different? > > For faster filters we have to apply them like we do deleted docs if > the filter is "random access" (such as being cached), LUCENE-1536 -- > flex actually makes this relatively easy now, since the postings API > no longer implicitly filters deleted docs (ie you provide your own > skipDocs) -- but these hooks won't fix that right? > > Mike > > On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer > <[email protected]> wrote: >> Hey Shai, >> >> On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera <[email protected]> wrote: >>> Hey Simon, >>> >>> You're right that the application can develop a Caching mechanism outside >>> Lucene, and when reopen() is called, if it changed, iterate on the >>> sub-readers and init the Cache w/ the new ones. >> >> Alright, then we are on the same track I guess! >> >>> >>> However, by building something like that inside Lucene, the application will >>> get more native support, and thus better performance, in some cases. For >>> example, consider a field fileType with 10 possible values, and for the sake >>> of simplicity, let's say that the index is divided evenly across them. Your >>> users always add such a term constraint to the query (e.g. they want to get >>> results of fileType:pdf or fileType:odt, and perhaps sometimes both, but not >>> others). You have basically two ways of supporting this: >>> (1) Add such a term to the query / clause to a BooleanQuery w/ an AND >>> relation -- cons is that this term / posting is read for every query. >> >> Oh I wasn't saying that a cache framework would be obsolet and >> shouldn't be part of lucene. My intention was rather to generalize >> this functionality so that we can make the API change more easily and >> at the same time brining the infrastructure you are proposing in >> place. >> >> Regarding you example above, filters are a very good example where >> something like that could help to improve performance and we should >> provide it with lucene core but I would again prefer the least >> intrusive way to do so. If we can make that happen without adding any >> cache agnostic API we should do it. We really should try to sketch out >> a simple API with gives us access to the opened segReaders and see if >> that would be sufficient for our usecases. Specialization will always >> be possible but I doubt that it is needed. >>> >>> (2) Write a Filter which works at the top IR level, that is refreshed >>> whenever the index is refreshed. This is better than (1), however has some >>> disadvantages: >>> >>> (2.1) As Mike already proved (on some issue which I don't remember its >>> subject/number at the moment), if we could get Filter down to the lower >>> level components of Lucene's search, so e.g. it is used as the deleted docs >>> OBS, we can get better performance w/ Filters. >>> >>> (2.2) The Filter is refreshed for the entire IR, and not just the changed >>> segments. Reason is, outside Collector, you have no way of telling >>> IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2". >>> Loading/refreshing the Filter may be expensive, and definitely won't perform >>> well w/ NRT, where by definition you'd like to get small changes searchable >>> very fast. >> >> No doubt you are right about the above. A >> PerSegmentCachingFilterWrapper would be something we can easily do on >> an application level basis with the infrastructure we are talking >> about in place. While I don't exactly know how I feel that this >> particular problem should rather be addressed internally and I'm not >> sure if the high level Cache mechanism is the right way to do it but >> this is just a gut feeling. But when I think about it twice it might >> be way sufficient enough to do it.... >>> >>> Therefore I think that if we could provide the necessary hooks in Lucene, >>> let's call it a Cache plug-in for now, we can incrementally improve the >>> search process. I don't want to go too far into the design of a generic >>> plug-ins mechanism, but you're right (again :)) -- we could offer a >>> reopen(PluginProvider) which is entirely not about Cache, and Cache would >>> become one of the Plugins the PluginProvider provides. I just try to learn >>> from past experience -- when the discussion is focused, there's a better >>> chance of getting to a resolution. However if you think that in this case, a >>> more generic API, as PluginProvider, would get us to a resolution faster, I >>> don't mind spend some time to think about it. But for all practical >>> purposes, we should IMO start w/ a Cache plug-in, that is called like that, >>> and if it catches, generify it ... >> I absolutely agree the API might be more generic but our current >> use-case / PoC should be a caching. I don't like the name Plugin but >> thats a personal thing since you are not pluggin anything is. >> Something like SubreaderCallback or ReaderVisitor might be more >> accurate but lets argue about the details later. Why not sketching >> something out for the filter problem and follow on from there? The >> more iteration the better and back to your question if that would be >> something which could make it to be committable I would say if it >> works stand alone / not to tightly coupled I would absolutely say yes. >>> >>> Unfortunately, I haven't had enough experience w/ Codecs yet (still on 3x) >>> so I can't comment on how feasible that solution is. I'll take your word for >>> it that it's doable :). But this doesn't give us a 3x solution ... the >>> Caching framework on trunk can be developed w/ Codecs. >> >> I guess nobody really has except of mike and maybe one or two others >> but what I have done so far regarding codecs I would say that is the >> place to solve this particular problem. Maybe even lower than that on >> a Directory level. Anyhow, lets focus on application level caches for >> now. We are not aiming to provide a whole full fledged Cache API but >> the infrastructure to make it easier to build those on a app basis >> which would be a valuable improvement. We should also look at Solr's >> cache implementations and how they could benefit from this efforts >> since Solr uses app-level caching we can learn from API design wise. >> >> simon >>> >>> Shai >>> >>> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer >>> <[email protected]> wrote: >>>> >>>> Hi Shai, >>>> >>>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera <[email protected]> wrote: >>>> > Hi >>>> > >>>> > Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831, >>>> > LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been >>>> > many proposals to attack this problem, w/ no developed solution. >>>> >>>> I didn't go through those issues so forgive me if something I bring up >>>> has already been discussed. >>>> I have a couple of question about your proposal - please find them >>>> inline... >>>> >>>> > >>>> > I'd like to explore a different, IMO much simpler, angle to attach this >>>> > problem. Instead of having Lucene manage the Cache itself, we let the >>>> > application manage it, however Lucene will provide the necessary hooks >>>> > in IndexReader to allow it. The hooks I have in mind are: >>>> > >>>> > (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc. >>>> > -- >>>> > already exists. >>>> > >>>> > (2) When reopen() is called, Lucene will take care to call a >>>> > Cache.load(IndexReader), so that the application can pull whatever >>>> > information >>>> > it needs from the passed-in IndexReader. >>>> Would that do anything else than passing the new reader (if reopened) >>>> to the caches load method? I wonder if this is more than >>>> If(newReader != oldReader) >>>> Cache.load(newReader) >>>> >>>> If so something like that should be done on a segment reader anyway, >>>> right? From my perspective this isn't more than a callback or visitor >>>> that should walk though the subreaders and called for each reopened >>>> sub-reader. A cache-warming visitor / callback would then be trivial >>>> and the API would be more general. >>>> >>>> >>>> > So to be more concrete on my proposal, I'd like to support caching in >>>> > the following way (and while I've spent some time thinking about it, I'm >>>> > sure there are great suggestions to improve it): >>>> > >>>> > * Application provides a CacheFactory to IndexReader.open/reopen, which >>>> > exposes some very simple API, such as createCache, or >>>> > initCache(IndexReader) etc. Something which returns a Cache object, >>>> > which does not have very strict/concrete API. >>>> >>>> My first question would be why the reader should know about Cache if >>>> there is no strict / concrete API? >>>> I can follow you with the CacheFactory to create cache objects but why >>>> would the reader have to know / "receive" this object? Maybe this is >>>> answered further down the path but I don't see the reason why the >>>> notion of a "cache" must exist within open/reopen or if that could be >>>> implemented in a more general "cache" - agnostic way. >>>> > >>>> > * IndexReader, most probably at the SegmentReader level uses >>>> > CacheFactory to create a new Cache instance and calls its >>>> > load(IndexReader) method, so that the Cache would initialize itself. >>>> That is what I was thinking above - yet is that more than a callback >>>> for each reopened or opened segment reader? >>>> >>>> > >>>> > * The application can use CacheFactory to obtain the Cache object per >>>> > IndexReader (for example, during Collector.setNextReader), or we can >>>> > have IndexReader offer a getCache() method. >>>> :) until here the cache is only used by the application itself not by >>>> any Lucene API, right? I can think of many application specific data >>>> that could be useful to be associated with an IR beyond the cacheing >>>> use case - again this could be a more general API solving that >>>> problem. >>>> > >>>> > * One of Cache API would be getCache(TYPE), where TYPE is a String or >>>> > Object, or an interface CacheType w/ no methods, just to be a marker >>>> > one, and the application is free to impl it however it wants. That's a >>>> > loose API, I know, but completely at the application hands, which makes >>>> > Lucene code simpler. >>>> I like the idea together with the metadata associating functionality >>>> from above something like public T IndexReader#get(Type<T> type). >>>> Hmm that looks quiet similar to Attributes, does it?! :) However this >>>> could be done in many ways but again "cache" - agnositc >>>> > >>>> > * We can introduce a TermsCache, TermEnumCache and TermVectorCache to >>>> > provide the user w/ IndexReader-similar API, only more efficient than >>>> > say TermDocs -- something w/ random access to the docs inside, perhaps >>>> > even an OpenBitSet. Lucene can take advantage of it if, say, we create a >>>> > CachingSegmentReader which makes use of the cache, and checks every time >>>> > termDocs() is called if the required Term is cached or not etc. I admit >>>> > I may be thinking too much ahead. >>>> I see what you are trying to do here. I also see how this could be >>>> useful but I guess coming up with a stable APi which serves lots of >>>> applications would be quiet hard. A CachingSegmentReader could be a >>>> very simple decorator which would not require to touch the IR >>>> interface. Something like that could be part of lucene but I'm not >>>> sure if necessarily part of lucene core. >>>> >>>> > That's more or less what I've been thinking. I'm sure there are many >>>> > details to iron out, but I hope I've managed to pass the general >>>> > proposal through to you. >>>> >>>> Absolutely, this is how it works isn't it! >>>> >>>> > >>>> > What I'm after first, is to allow applications deal w/ postings caching >>>> > more >>>> > natively. For example, if you have a posting w/ payloads you'd like to >>>> > read into memory, or if you would like a term's TermDocs to be cached >>>> > (to be used as a Filter) etc. -- instead of writing something that can >>>> > work at the top IndexReader level, you'd be able to take advantage of >>>> > Lucene internals, i.e. refresh the Cache only for the new segments ... >>>> >>>> I wonder if a custom codec would be the right place to implement >>>> caching / mem resident structures for Postings with payloads etc. You >>>> could do that on a higher level too but codec seems to be the way to >>>> go here, right? >>>> To utilize per segment capabilities a callback for (re)opened segment >>>> readers would be sufficient or do I miss something? >>>> >>>> simon >>>> > >>>> > I'm sure that after this will be in place, we can refactor FieldCache to >>>> > work w/ that API, perhaps as a Cache specific implementation. But I >>>> > leave that for later. >>>> > >>>> > I'd appreciate your comments. Before I set to implement it, I'd like to >>>> > know if the idea has any chances of making it to a commit :). >>>> > >>>> > Shai >>>> > >>>> > >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Kirill Zakharenko/Кирилл Захаренко ([email protected]) Phone: +7 (495) 683-567-4 ICQ: 104465785 --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
