[ https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695726#action_12695726 ]
Michael McCandless commented on LUCENE-1231: -------------------------------------------- {quote} Eventually we need more flexibility to utilize the flexible indexing chain anyway. We need to store which codec to use for a field. Then we could also just make a new codec for column-stride fields and maybe then we do not have to introduce a new Field API. {quote} By creating a custom indexing chain you could actually write CSF, today. But the lack of extensibility of Field needs to be addressed: you need some way to store something arbitrary & opaque into a field such that your indexing chain could pick it up and act. And FieldInfos also needs "store this opaque thing for me" API. One of the big changes in LUCENE-1458 is to strongly separate different fields on the read APIs. EG there is a separate FieldsEnum from TermsEnum, meaning you first seek to the field you want, then request a TermsEnum from that, which can iterate through the terms only for that field. It's the codec's job to return the right TermsEnum for a given field. Not to delay 2.9 further, but... I also wonder if Lucene had NumericField (say), how it would simplify things here. EG, today, if I have a field "weight" that is a float, I'm going to have to set something to tell the CSF (man the similarity of that to CFS is going to cause problems!) writer to cast-it-and-save-it-as-float-array to disk; I'm going to have to tell the TrieRangeUtil to do the same, etc. It'd be much better if that field stored a float (not String), and if it default "naturally" to using these two special indexers... {quote} DataIn(Out)put would implement the different read and write methods, whereas IndexIn(Out)put would only implement methods like close(), seek(), getFilePointer(), length(), flush(), etc. {quote} What is the fastest way in Java to slurp in a bunch of bytes as an int[], short[], float[], etc? Seems that we need to answer that first and then work out how to fix our store APIs to enable that. (Maybe it's IntBuffer wrapping ByteBuffer, instead of an int[]?). {quote} The danger here compared to the current payloads API would be that the user might read too few or too many bytes of a CSF, which would result in an undefined and possibly hard to debug behavior. {quote} I think it's better to have good performance with added risk of danger, then forced handholding always. {quote} The SafeAccessor would count for you the number of read bytes and throw exceptions if you don't consume the number of bytes you should consume. {quote} I generally prefer liberal use of asserts to trip bugs like this, instead of explicit strongly divoced code paths / classes / modes etc., containing real if statements at production runtime. > Column-stride fields (aka per-document Payloads) > ------------------------------------------------ > > Key: LUCENE-1231 > URL: https://issues.apache.org/jira/browse/LUCENE-1231 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 3.0 > > > This new feature has been proposed and discussed here: > http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results > Currently it is possible in Lucene to store data as stored fields or as > payloads. > Stored fields provide good performance if you want to load all fields for one > document, because this is an sequential I/O operation. > If you however want to load the data from one field for a large number of > documents, then stored fields perform quite badly, because lot's of I/O seeks > might have to be performed. > A better way to do this is using payloads. By creating a "special" posting > list > that has one posting with payload for each document you can "simulate" a > column- > stride field. The performance is significantly better compared to stored > fields, > however still not optimal. The reason is that for each document the freq > value, > which is in this particular case always 1, has to be decoded, also one > position > value, which is always 0, has to be loaded. > As a solution we want to add real column-stride fields to Lucene. A possible > format for the new data structure could look like this (CSD stands for column- > stride data, once we decide for a final name for this feature we can change > this): > CSDList --> FixedLengthList | <VariableLengthList, SkipList> > FixedLengthList --> <Payload>^SegSize > VariableLengthList --> <DocDelta, PayloadLength?, Payload> > Payload --> Byte^PayloadLength > PayloadLength --> VInt > SkipList --> see frq.file > We distinguish here between the fixed length and the variable length cases. To > allow flexibility, Lucene could automatically pick the "right" data > structure. > This could work like this: When the DocumentsWriter writes a segment it > checks > whether all values of a field have the same length. If yes, it stores them as > FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger > merges two or more segments it checks if all segments have a FixedLengthList > with the same length for a column-stride field. If not, it writes a > VariableLengthList to the new segment. > Once this feature is implemented, we should think about making the column- > stride fields updateable, similar to the norms. This will be a very powerful > feature that can for example be used for low-latency tagging of documents. > Other use cases: > - replace norms > - allow to store boost values separately from norms > - as input for the FieldCache, thus providing significantly improved loading > performance (see LUCENE-831) > Things that need to be done here: > - decide for a name for this feature :) - I think "column-stride fields" was > liked better than "per-document payloads" > - Design an API for this feature. We should keep in mind here that these > fields are supposed to be updateable. > - Define datastructures. > I would like to get this feature into 2.4. Feedback about the open questions > is very welcome so that we can finalize the design soon and start > implementing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org