[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796200#action_12796200 ]
Michael McCandless commented on LUCENE-2186: -------------------------------------------- bq. Is this patch for flex, as it contains CodecUtils and so on? Actually it's intended for trunk; I was thinking this should land before flex (it's a much smaller change, and it's "isolated" from flex), and so I wrote the CodecUtil/BytesRef basic infrastructure, thinking flex would then cutover to them. {quote} Hmm, so random-access would obviously be the preferred approach for SSDs, but with conventional disks I think the performance would be poor? In 1231 I implemented the var-sized CSF with a skip list, similar to a posting list. I think we should add that here too and we can still keep the additional index that stores the pointers? We could have two readers: one that allows random-access and loads the pointers into RAM (or uses MMAP as you mentioned), and a second one that doesn't load anything into RAM, uses the skip lists and only allows iterator-based access? {quote} The intention here is for this ("index values") to replace field cache, but not aim (initially at least) to do much more. Ie, it's "meant" to be a RAM resident (either via explicit slurping-into-RAM or via MMAP). So the SSD or spinning magnets should not be hit on retrieval. If we add an iterator API, I think it should be simpler than the postings API (ie, no seeking, dense (every doc is visited, sequentially) iteration). {quote} It looks like ByteRef is very similar to Payload? Could you use that instead and extend it with the new String constructor and compare methods? {quote} Good point! I agree. Also, we should use BytesRef when reading the payload from TermsEnum. Actually I think Payload, BytesRef, TermRef (in flex) should all eventually be merged; of the three names, I think I like BytesRef the best. With *Enum in flex we can switch to BytesRef. For analysis we should switch PayloadAttribute to BytesRef, and deprecate the methods using Payload? Hmmm... but PayloadAttribute is an interface. {quote} So it looks like with your approach you want to support certain "primitive" types out of the box, such as byte[], float, int, String? {quote} Actually, all "primitive" types (ie, byte/short/int/long are "included" under int, as well as arbitrary bit precision "between" those primitive types). Because the API uses a method invocation (eg IntSource.get) instead of direct array access, we can "hide" how many bits are actually used, under the impl. Same is true for float/double (except we can't [easily] do arbitrary bit precision here... just 4 or 8 bytes). {quote} If someone has custom data types, then they have, similar as with payloads today, the byte[] indirection? {quote} Right, byte[] is for String, but also for arbitrary (opaque to Lucene) extensibility. The six anonymous (separate package private classes) concrete impls should give good efficiency to fit the different use cases. {quote} The code I initially wrote for 1231 exposed IndexOutput, so that one can call write*() directly, without having to convert to byte[] first. I think we will also want to do that for 2125 (store attributes in the index). So I'm wondering if this and 2125 should work similarly? {quote} This is compelling (letting Attrs read/write directly), but, I have some questions: * How would the random-access API work? (Attrs are designed for iteration). Eg, just providing IndexInput/Output to the Attr isn't quite enough -- the encoding is sometimes context dependent (like frq writes the delta between docIDs, the symbol table needed when reading/writing deref/sorted). How would I build a random access API on top of that? captureState-per-doc is too costly. What API would be used to write the shared state, ie, to tell the Attr "we now are writing the segment, so you need to dump the symbol table". * How would the packed ints work? EG say my ints only need 5 bits. (Attrs are sort of designed for one-value-at-once). * How would the "symbol table" based encodings (deref, sorted) work? I guess the attr would need to have some state associated with it, and when I first create the attr I need to pass it segment name, Directory, etc, so it opens the right files? * I'm thinking we should still directly support native types, ie, Attrs are there for extensibility beyond native types? * Exposing single attr across a multi reader sounds tricky -- LUCENE-2154 (and, we need this for flex, which is worrying me!). But it sounds like you and Uwe are making some progress on that (using some under-the-hood Java reflection magic)... and this doesn't directly affect this issue, assuming we don't expose this API at the MultiReader level. {quote} Thinking out loud: Could we have then attributes with serialize/deserialize methods for primitive types, such as float? Could we efficiently use such an approach all the way up to FieldCache? It would be compelling if you could store an attribute as CSF, or in the postinglist, retrieve it from the flex APIs, and also from the FieldCache. All would be the same API and there would only be one place that needs to "know" about the encoding (the attribute). {quote} This is the grand unification of everything :) I like it, but, I don't want that future utopia to stall our progress today... ie I'd rather do something simple yet concrete, now, and then work step by step towards that future ("progress not perfection"). That said, if we can get some bite sized step in, today, towards that future, that'd be good. Eg, the current patch only supports "dense" storage, ie it's assumed every document will have a value, because it's aiming to replace field cache. If we wanted to add sparse storage... I think that'd require/strongly encourage access via a postings-like iteration API, which I don't see how to take a baby step towards :) I do think it would be compelling for an Attr to "only" have to expose read/write methods, and then the Attr can be stored in CSF or postings, but I don't see how to make an efficient random-access API on top of that. I think it's in LUCENE-2125 where we should explore this. Norms and deleted docs should be able to eventually switch to CSF. In fact, norms should just be a FloatSource, with default impl being the 1-byte float encoding we use today. This then gives apps full flexibility to plugin their own FloatSource. For deleted docs we should probably create a BoolSource. {quote} About updating CSF: I hope we can use parallel indexing for that. In other words: It should be possible for users to use parallel indexes to update certain fields, and Lucene should use the same approach internally to store different "generations" of things like norms and CSFs. {quote} That sounds great, though, I think we need a more efficient way to store the changes. Ie, norms rewrites all norms on any change, which is costly. It'd be better to have some sort of delta format, where you sparsely encode docID + new value, and then when loading we merge those on the fly (and, segment merging periodically also merges & commits them). {quote} Yeah, that's where I got kind of stuck with 1231: We need to figure out how the public API should look like, with which a user can add CSF values to the index and retrieve them. The easiest and fastest way would be to add a dedicated new API. The cleaner one would be to make the whole Document/Field/FieldInfos API more flexible. LUCENE-1597 was a first attempt. {quote} Right, but LUCENE-1597 is another good but far-away-from-landing goal. I think a dedicated API is fine for the atomic types. Field cache today is a dedicated API... I guess to sum up my thoughts now (but I'm still mulling...): * I think the random-access-field-cache-like-API should be separate from the designed-for-iteration-from-a-file postings API. * Attrs for extensibilty could be compelling, but I don't see how to build an [efficient] random access API on top of Attrs. It would be very elegant only having to add a read/write method to your Attr, but, that's not really enough for a full codec. * I don't think we should hold up adding direct support for atomic types until/if we can figure out how to add Attrs. Ie I think we should do this in two steps. The current patch is [roughly] step 1, and I think should be a compelling replacement for field cache. Memory usage and GC cost of string sorting should be much lower than field cache. I'm also still mulling on these issues w/ the current patch: * How could we use index values to efficiently maintain stats needed for flexible scoring (LUCENE-2187). * Current patch doesn't handle merging yet. * Could norms/deleted docs "conceivably" cutover to index values API? * What "dedicated API" for indexing & sorting. * Run basic perf tests to see cost of using method instead of direct array. > First cut at column-stride fields (index values storage) > -------------------------------------------------------- > > Key: LUCENE-2186 > URL: https://issues.apache.org/jira/browse/LUCENE-2186 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2186.patch > > > I created an initial basic impl for storing "index values" (ie > column-stride value storage). This is still a work in progress... but > the approach looks compelling. I'm posting my current status/patch > here to get feedback/iterate, etc. > The code is standalone now, and lives under new package > oal.index.values (plus some util changes, refactorings) -- I have yet > to integrate into Lucene so eg you can mark that a given Field's value > should be stored into the index values, sorting will use these values > instead of field cache, etc. > It handles 3 types of values: > * Six variants of byte[] per doc, all combinations of fixed vs > variable length, and stored either "straight" (good for eg a > "title" field), "deref" (good when many docs share the same value, > but you won't do any sorting) or "sorted". > * Integers (variable bit precision used as necessary, ie this can > store byte/short/int/long, and all precisions in between) > * Floats (4 or 8 byte precision) > String fields are stored as the UTF8 byte[]. This patch adds a > BytesRef, which does the same thing as flex's TermRef (we should merge > them). > This patch also adds basic initial impl of PackedInts (LUCENE-1990); > we can swap that out if/when we get a better impl. > This storage is dense (like field cache), so it's appropriate when the > field occurs in all/most docs. It's just like field cache, except the > reading API is a get() method invocation, per document. > Next step is to do basic integration with Lucene, and then compare > sort performance of this vs field cache. > For the "sort by String value" case, I think RAM usage & GC load of > this index values API should be much better than field caache, since > it does not create object per document (instead shares big long[] and > byte[] across all docs), and because the values are stored in RAM as > their UTF8 bytes. > There are abstract Writer/Reader classes. The current reader impls > are entirely RAM resident (like field cache), but the API is (I think) > agnostic, ie, one could make an MMAP impl instead. > I think this is the first baby step towards LUCENE-1231. Ie, it > cannot yet update values, and the reading API is fully random-access > by docID (like field cache), not like a posting list, though I > do think we should add an iterator() api (to return flex's DocsEnum) > -- eg I think this would be a good way to track avg doc/field length > for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org