"A better way to do this is using payloads. By creating a "special" posting list that has one posting with payload for each document you can "simulate" a column- stride field. The performance is significantly better compared to stored fields, however still not optimal. The reason is that for each document the freq value, which is in this particular case always 1, has to be decoded, also one position value, which is always 0, has to be loaded."
If we put this approach into http://wiki.apache.org/jakarta-lucene/FlexibleIndexing context, than one special case of it would remove performance obstacles you have mentioned. Would it be easier to tackle these issues and have both problems fixed? I am not very familiar with Lucene file formats, so please take this with a pinch of salt. ----- Original Message ---- From: Michael Busch (JIRA) <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Friday, 14 March, 2008 7:57:24 AM Subject: [jira] Created: (LUCENE-1231) Column-stride fields (aka per-document Payloads) Column-stride fields (aka per-document Payloads) ------------------------------------------------ Key: LUCENE-1231 URL: https://issues.apache.org/jira/browse/LUCENE-1231 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.4 This new feature has been proposed and discussed here: http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results Currently it is possible in Lucene to store data as stored fields or as payloads. Stored fields provide good performance if you want to load all fields for one document, because this is an sequential I/O operation. If you however want to load the data from one field for a large number of documents, then stored fields perform quite badly, because lot's of I/O seeks might have to be performed. A better way to do this is using payloads. By creating a "special" posting list that has one posting with payload for each document you can "simulate" a column- stride field. The performance is significantly better compared to stored fields, however still not optimal. The reason is that for each document the freq value, which is in this particular case always 1, has to be decoded, also one position value, which is always 0, has to be loaded. As a solution we want to add real column-stride fields to Lucene. A possible format for the new data structure could look like this (CSD stands for column- stride data, once we decide for a final name for this feature we can change this): CSDList --> FixedLengthList | <VariableLengthList, SkipList> FixedLengthList --> <Payload>^SegSize VariableLengthList --> <DocDelta, PayloadLength?, Payload> Payload --> Byte^PayloadLength PayloadLength --> VInt SkipList --> see frq.file We distinguish here between the fixed length and the variable length cases. To allow flexibility, Lucene could automatically pick the "right" data structure. This could work like this: When the DocumentsWriter writes a segment it checks whether all values of a field have the same length. If yes, it stores them as FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger merges two or more segments it checks if all segments have a FixedLengthList with the same length for a column-stride field. If not, it writes a VariableLengthList to the new segment. Once this feature is implemented, we should think about making the column- stride fields updateable, similar to the norms. This will be a very powerful feature that can for example be used for low-latency tagging of documents. Other use cases: - replace norms - allow to store boost values separately from norms - as input for the FieldCache, thus providing significantly improved loading performance (see LUCENE-831) Things that need to be done here: - decide for a name for this feature :) - I think "column-stride fields" was liked better than "per-document payloads" - Design an API for this feature. We should keep in mind here that these fields are supposed to be updateable. - Define datastructures. I would like to get this feature into 2.4. Feedback about the open questions is very welcome so that we can finalize the design soon and start implementing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __________________________________________________________ Sent from Yahoo! Mail. The World's Favourite Email http://uk.docs.yahoo.com/nowyoucan.html --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]