[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

Michael McCandless (JIRA) Sat, 04 Apr 2009 05:15:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695726#action_12695726
 ]


Michael McCandless commented on LUCENE-1231:
--------------------------------------------

{quote}
Eventually we need more flexibility to utilize the flexible indexing
chain anyway. We need to store which codec to use for a field. Then we
could also just make a new codec for column-stride fields and maybe
then we do not have to introduce a new Field API.
{quote}

By creating a custom indexing chain you could actually write CSF,
today.

But the lack of extensibility of Field needs to be addressed: you need
some way to store something arbitrary & opaque into a field such that
your indexing chain could pick it up and act.

And FieldInfos also needs "store this opaque thing for me" API.

One of the big changes in LUCENE-1458 is to strongly separate
different fields on the read APIs.  EG there is a separate FieldsEnum
from TermsEnum, meaning you first seek to the field you want, then
request a TermsEnum from that, which can iterate through the terms
only for that field.  It's the codec's job to return the right
TermsEnum for a given field.

Not to delay 2.9 further, but... I also wonder if Lucene had
NumericField (say), how it would simplify things here.  EG, today, if
I have a field "weight" that is a float, I'm going to have to set
something to tell the CSF (man the similarity of that to CFS is going
to cause problems!) writer to cast-it-and-save-it-as-float-array to
disk; I'm going to have to tell the TrieRangeUtil to do the same, etc.
It'd be much better if that field stored a float (not String), and if
it default "naturally" to using these two special indexers...

{quote}
DataIn(Out)put would implement the different read and
write methods, whereas IndexIn(Out)put would only implement methods
like close(), seek(), getFilePointer(), length(), flush(), etc.
{quote}

What is the fastest way in Java to slurp in a bunch of bytes as an
int[], short[], float[], etc?  Seems that we need to answer that first
and then work out how to fix our store APIs to enable that.  (Maybe
it's IntBuffer wrapping ByteBuffer, instead of an int[]?).

{quote}
The danger here compared to the current
payloads API would be that the user might read too few or too many
bytes of a CSF, which would result in an undefined and possibly hard
to debug behavior.
{quote}

I think it's better to have good performance with added risk of
danger, then forced handholding always.

{quote}
The SafeAccessor would count for you the number of read bytes and
throw exceptions if you don't consume the number of bytes you should
consume.
{quote}

I generally prefer liberal use of asserts to trip bugs like this,
instead of explicit strongly divoced code paths / classes / modes
etc., containing real if statements at production runtime.


> Column-stride fields (aka per-document Payloads)
> ------------------------------------------------
>
>                 Key: LUCENE-1231
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1231
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.0
>
>
> This new feature has been proposed and discussed here:
> http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
> Currently it is possible in Lucene to store data as stored fields or as 
> payloads.
> Stored fields provide good performance if you want to load all fields for one
> document, because this is an sequential I/O operation.
> If you however want to load the data from one field for a large number of 
> documents, then stored fields perform quite badly, because lot's of I/O seeks 
> might have to be performed. 
> A better way to do this is using payloads. By creating a "special" posting 
> list
> that has one posting with payload for each document you can "simulate" a 
> column-
> stride field. The performance is significantly better compared to stored 
> fields,
> however still not optimal. The reason is that for each document the freq 
> value,
> which is in this particular case always 1, has to be decoded, also one 
> position
> value, which is always 0, has to be loaded.
> As a solution we want to add real column-stride fields to Lucene. A possible
> format for the new data structure could look like this (CSD stands for column-
> stride data, once we decide for a final name for this feature we can change 
> this):
> CSDList --> FixedLengthList | <VariableLengthList, SkipList> 
> FixedLengthList --> <Payload>^SegSize 
> VariableLengthList --> <DocDelta, PayloadLength?, Payload> 
> Payload --> Byte^PayloadLength 
> PayloadLength --> VInt 
> SkipList --> see frq.file
> We distinguish here between the fixed length and the variable length cases. To
> allow flexibility, Lucene could automatically pick the "right" data 
> structure. 
> This could work like this: When the DocumentsWriter writes a segment it 
> checks 
> whether all values of a field have the same length. If yes, it stores them as 
> FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger 
> merges two or more segments it checks if all segments have a FixedLengthList 
> with the same length for a column-stride field. If not, it writes a 
> VariableLengthList to the new segment. 
> Once this feature is implemented, we should think about making the column-
> stride fields updateable, similar to the norms. This will be a very powerful
> feature that can for example be used for low-latency tagging of documents.
> Other use cases:
> - replace norms
> - allow to store boost values separately from norms
> - as input for the FieldCache, thus providing significantly improved loading
> performance (see LUCENE-831)
> Things that need to be done here:
> - decide for a name for this feature :) - I think "column-stride fields" was
> liked better than "per-document payloads"
> - Design an API for this feature. We should keep in mind here that these 
> fields are supposed to be updateable.
> - Define datastructures.
> I would like to get this feature into 2.4. Feedback about the open questions
> is very welcome so that we can finalize the design soon and start 
> implementing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

Reply via email to