"A better way to do this is using payloads. By creating a "special" posting list
that has one posting with payload for each document you can "simulate" a column-
stride field. The performance is significantly better compared to stored fields,
however still not optimal. The reason is that for each document the freq value,
which is in this particular case always 1, has to be decoded, also one position
value, which is always 0, has to be loaded."

If we put this approach into 
http://wiki.apache.org/jakarta-lucene/FlexibleIndexing context, than one 
special case of it would remove performance obstacles  you have mentioned. 
Would it be easier to tackle these issues and have both problems fixed?
I am not very familiar with Lucene file formats, so please take this with a 
pinch of salt.

----- Original Message ----
From: Michael Busch (JIRA) <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Friday, 14 March, 2008 7:57:24 AM
Subject: [jira] Created: (LUCENE-1231) Column-stride fields (aka per-document 
Payloads)

Column-stride fields (aka per-document Payloads)
------------------------------------------------

                 Key: LUCENE-1231
                 URL: https://issues.apache.org/jira/browse/LUCENE-1231
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Index
            Reporter: Michael Busch
            Assignee: Michael Busch
            Priority: Minor
             Fix For: 2.4


This new feature has been proposed and discussed here:
http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results

Currently it is possible in Lucene to store data as stored fields or as 
payloads.
Stored fields provide good performance if you want to load all fields for one
document, because this is an sequential I/O operation.

If you however want to load the data from one field for a large number of 
documents, then stored fields perform quite badly, because lot's of I/O seeks 
might have to be performed. 

A better way to do this is using payloads. By creating a "special" posting list
that has one posting with payload for each document you can "simulate" a column-
stride field. The performance is significantly better compared to stored fields,
however still not optimal. The reason is that for each document the freq value,
which is in this particular case always 1, has to be decoded, also one position
value, which is always 0, has to be loaded.

As a solution we want to add real column-stride fields to Lucene. A possible
format for the new data structure could look like this (CSD stands for column-
stride data, once we decide for a final name for this feature we can change 
this):

CSDList --> FixedLengthList | <VariableLengthList, SkipList> 
FixedLengthList --> <Payload>^SegSize 
VariableLengthList --> <DocDelta, PayloadLength?, Payload> 
Payload --> Byte^PayloadLength 
PayloadLength --> VInt 
SkipList --> see frq.file

We distinguish here between the fixed length and the variable length cases. To
allow flexibility, Lucene could automatically pick the "right" data structure. 
This could work like this: When the DocumentsWriter writes a segment it checks 
whether all values of a field have the same length. If yes, it stores them as 
FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger 
merges two or more segments it checks if all segments have a FixedLengthList 
with the same length for a column-stride field. If not, it writes a 
VariableLengthList to the new segment. 

Once this feature is implemented, we should think about making the column-
stride fields updateable, similar to the norms. This will be a very powerful
feature that can for example be used for low-latency tagging of documents.

Other use cases:
- replace norms
- allow to store boost values separately from norms
- as input for the FieldCache, thus providing significantly improved loading
performance (see LUCENE-831)

Things that need to be done here:
- decide for a name for this feature :) - I think "column-stride fields" was
liked better than "per-document payloads"
- Design an API for this feature. We should keep in mind here that these 
fields are supposed to be updateable.
- Define datastructures.

I would like to get this feature into 2.4. Feedback about the open questions
is very welcome so that we can finalize the design soon and start 
implementing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






      __________________________________________________________
Sent from Yahoo! Mail.
The World's Favourite Email http://uk.docs.yahoo.com/nowyoucan.html


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to