[ https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582464#action_12582464 ]
Michael McCandless commented on LUCENE-1231: -------------------------------------------- Sorry you're right: the payload is the binary data. {quote} So there are a number of features these fields would have that differ from other fields: {quote} Maybe add "stored in its own file" or some such, to that list. Ie to efficiently update field X I would think you want it stored in its own file. We would then fully write a new geneation of that file whenever it had changes. I agree it would be great to implement this as "flexible indexing", such that these are simply a-la-cart options on how the field is indexed, rather than make a new specialized kind of field that just does one of these "combinations". But I haven't wrapped my brain around what all this will entail... it's a biggie! {quote} BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for updating sparse fields. {quote} I like that! > Column-stride fields (aka per-document Payloads) > ------------------------------------------------ > > Key: LUCENE-1231 > URL: https://issues.apache.org/jira/browse/LUCENE-1231 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.4 > > > This new feature has been proposed and discussed here: > http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results > Currently it is possible in Lucene to store data as stored fields or as > payloads. > Stored fields provide good performance if you want to load all fields for one > document, because this is an sequential I/O operation. > If you however want to load the data from one field for a large number of > documents, then stored fields perform quite badly, because lot's of I/O seeks > might have to be performed. > A better way to do this is using payloads. By creating a "special" posting > list > that has one posting with payload for each document you can "simulate" a > column- > stride field. The performance is significantly better compared to stored > fields, > however still not optimal. The reason is that for each document the freq > value, > which is in this particular case always 1, has to be decoded, also one > position > value, which is always 0, has to be loaded. > As a solution we want to add real column-stride fields to Lucene. A possible > format for the new data structure could look like this (CSD stands for column- > stride data, once we decide for a final name for this feature we can change > this): > CSDList --> FixedLengthList | <VariableLengthList, SkipList> > FixedLengthList --> <Payload>^SegSize > VariableLengthList --> <DocDelta, PayloadLength?, Payload> > Payload --> Byte^PayloadLength > PayloadLength --> VInt > SkipList --> see frq.file > We distinguish here between the fixed length and the variable length cases. To > allow flexibility, Lucene could automatically pick the "right" data > structure. > This could work like this: When the DocumentsWriter writes a segment it > checks > whether all values of a field have the same length. If yes, it stores them as > FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger > merges two or more segments it checks if all segments have a FixedLengthList > with the same length for a column-stride field. If not, it writes a > VariableLengthList to the new segment. > Once this feature is implemented, we should think about making the column- > stride fields updateable, similar to the norms. This will be a very powerful > feature that can for example be used for low-latency tagging of documents. > Other use cases: > - replace norms > - allow to store boost values separately from norms > - as input for the FieldCache, thus providing significantly improved loading > performance (see LUCENE-831) > Things that need to be done here: > - decide for a name for this feature :) - I think "column-stride fields" was > liked better than "per-document payloads" > - Design an API for this feature. We should keep in mind here that these > fields are supposed to be updateable. > - Define datastructures. > I would like to get this feature into 2.4. Feedback about the open questions > is very welcome so that we can finalize the design soon and start > implementing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]