[jira] Commented: (LUCENE-2125) Ability to store and retrieve attributes in the inverted index

Michael McCandless (JIRA) Mon, 07 Dec 2009 02:45:44 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786856#action_12786856
 ]


Michael McCandless commented on LUCENE-2125:
--------------------------------------------

{quote}
bq. I wonder if we need to allow codecs to store data into 
SegmentInfo/FieldInfo for this (we don't now).

IMO we definitely do. E.g. for backwards-compatibility: if users switch the 
encoding
of an attribute, then they need a way to determine in which format it is stored 
in a 
given segment.

And we need to open up FieldInfo too: it has to store which and in what order 
the
attributes are stored.

I'm sure these are the things you had in mind too?
{quote}

Well... some stuff should be written into the header of each file, so eg a 
switch to encoding could be handled by the simple versioning the Codec API 
gives you (Codec.writeHeader/Codec.checkHeader).

But, yeah, for other stuff I've been assuming we need to open up 
Segment/FieldInfo.

So eg "omitTermFreqAndPositions" is something we could conceivably put under 
codec control, ie, Lucene core shouldn't need to know this attr even exists.  
But, then we'd need extensibility of Field as well.  We've discussed splitting 
this setting, to separately control whether the freq is written and whether the 
positions are written, which makes complete sense.  It'd be great if such a 
change could be cleanly handled by simply creating a new version of the codec.  
Likewise, "hasProx", which is derived from the omitTFAPs of all fields within 
the segment, should be computed/managed entirely within the codec.

> Ability to store and retrieve attributes in the inverted index
> --------------------------------------------------------------
>
>                 Key: LUCENE-2125
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2125
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: Flex Branch
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: Flex Branch
>
>
> Now that we have the cool attribute-based TokenStream API and also the
> great new flexible indexing features, the next logical step is to
> allow storing the attributes inline in the posting lists. Currently
> this is only supported for the PayloadAttribute.
> The flex search APIs already provide an AttributeSource, so there will
> be a very clean and performant symmetry. It should be seamlessly
> possible for the user to define a new attribute, add it to the
> TokenStream, and then retrieve it from the flex search APIs.
> What I'm planning to do is to add additional methods to the token
> attributes (e.g. by adding a new class TokenAttributeImpl, which
> extends AttributeImpl and is the super class of all impls in
> o.a.l.a.tokenattributes):
> - void serialize(DataOutput)
> - void deserialize(DataInput)
> - boolean storeInIndex()
> The indexer will only call the serialize method of an
> TokenAttributeImpl in case its storeInIndex() returns true. 
> The big advantage here is the ease-of-use: A user can implement in one
> place everything necessary to add the attribute to the index.
> Btw: I'd like to introduce DataOutput and DataInput as super classes
> of IndexOutput and IndexInput. They will contain methods like
> readByte(), readVInt(), etc., but methods such as close(),
> getFilePointer() etc. will stay in the super classes.
> Currently the payload concept is hardcoded in 
> TermsHashPerField and FreqProxTermsWriterPerField. These classes take
> care of copying the contents of the PayloadAttribute over into the 
> intermediate in-memory postinglist representation and reading it
> again. Ideally these classes should not know about specific
> attributes, but only call serialze() on those attributes that shall
> be stored in the posting list.
> We also need to change the PositionsEnum and PositionsConsumer APIs to
> deal with attributes instead of payloads.
> I think the new codecs should all support storing attributes. Only the
> preflex one should be hardcoded to only take the PayloadAttribute into
> account.
> We'll possibly need another extension point that allows us to influence 
> compression across multiple postings. Today we use the
> length-compression trick for the payloads: if the previous payload had
> the same length as the current one, we don't store the length
> explicitly again, but only set a bit in the shifted position VInt. Since
> often all payloads of one posting list have the same length, this
> results in effective compression.
> Now an advanced user might want to implement a similar encoding, where
> it's not enough to just control serialization of a single value, but
> where e.g. the previous position can be taken into account to decide
> how to encode a value. 
> I'm not sure yet how this extension point should look like. Maybe the
> flex APIs are actually already sufficient.
> One major goal of this feature is performance: It ought to be more 
> efficient to e.g. define an attribute that writes and reads a single 
> VInt than storing that VInt as a payload. The payload has the overhead
> of converting the data into a byte array first. An attribute on the other 
> hand should be able to call 'int value = dataInput.readVInt();' directly
> without the byte[] indirection.
> After this part is done I'd like to use a very similar approach for
> column-stride fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2125) Ability to store and retrieve attributes in the inverted index

Reply via email to