[jira] Commented: (LUCENE-2125) Ability to store and retrieve attributes in the inverted index

Michael Busch (JIRA) Mon, 07 Dec 2009 02:17:45 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786848#action_12786848
 ]


Michael Busch commented on LUCENE-2125:
---------------------------------------

{quote}
So you'd remove the explicit payload methods in PositionsEnum? Ie,
users on migrating to flex would have to switch to the payloads
attribute?
{quote}

I think that would make sense? Payloads don't have to be treated specially 
anymore,
if any attribute can be stored in the posting lists.

{quote}
Note the that preflex codec only has a reader (FieldsProducer), not a
writer. Ie you can read the old index format but not write it.
{quote}

Hmm, so the concern is that people *have* to make the switch to the flex APIs
after upgrading to the next Lucene version if they want to create indexes with 
good old payloads?

{quote}
Ideally the serialize/unserialize could efficiently handle the
fixed-length case without using up the 1 bit in the index.
{quote}

Yes!

{quote}
I wonder if we need to allow codecs to store data into
SegmentInfo/FieldInfo for this (we don't now).
{quote}

IMO we definitely do. E.g. for backwards-compatibility: if users switch the 
encoding
of an attribute, then they need a way to determine in which format it is stored 
in a 
given segment.

And we need to open up FieldInfo too: it has to store which and in what order 
the
attributes are stored. 

I'm sure these are the things you had in mind too?


> Ability to store and retrieve attributes in the inverted index
> --------------------------------------------------------------
>
>                 Key: LUCENE-2125
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2125
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: Flex Branch
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: Flex Branch
>
>
> Now that we have the cool attribute-based TokenStream API and also the
> great new flexible indexing features, the next logical step is to
> allow storing the attributes inline in the posting lists. Currently
> this is only supported for the PayloadAttribute.
> The flex search APIs already provide an AttributeSource, so there will
> be a very clean and performant symmetry. It should be seamlessly
> possible for the user to define a new attribute, add it to the
> TokenStream, and then retrieve it from the flex search APIs.
> What I'm planning to do is to add additional methods to the token
> attributes (e.g. by adding a new class TokenAttributeImpl, which
> extends AttributeImpl and is the super class of all impls in
> o.a.l.a.tokenattributes):
> - void serialize(DataOutput)
> - void deserialize(DataInput)
> - boolean storeInIndex()
> The indexer will only call the serialize method of an
> TokenAttributeImpl in case its storeInIndex() returns true. 
> The big advantage here is the ease-of-use: A user can implement in one
> place everything necessary to add the attribute to the index.
> Btw: I'd like to introduce DataOutput and DataInput as super classes
> of IndexOutput and IndexInput. They will contain methods like
> readByte(), readVInt(), etc., but methods such as close(),
> getFilePointer() etc. will stay in the super classes.
> Currently the payload concept is hardcoded in 
> TermsHashPerField and FreqProxTermsWriterPerField. These classes take
> care of copying the contents of the PayloadAttribute over into the 
> intermediate in-memory postinglist representation and reading it
> again. Ideally these classes should not know about specific
> attributes, but only call serialze() on those attributes that shall
> be stored in the posting list.
> We also need to change the PositionsEnum and PositionsConsumer APIs to
> deal with attributes instead of payloads.
> I think the new codecs should all support storing attributes. Only the
> preflex one should be hardcoded to only take the PayloadAttribute into
> account.
> We'll possibly need another extension point that allows us to influence 
> compression across multiple postings. Today we use the
> length-compression trick for the payloads: if the previous payload had
> the same length as the current one, we don't store the length
> explicitly again, but only set a bit in the shifted position VInt. Since
> often all payloads of one posting list have the same length, this
> results in effective compression.
> Now an advanced user might want to implement a similar encoding, where
> it's not enough to just control serialization of a single value, but
> where e.g. the previous position can be taken into account to decide
> how to encode a value. 
> I'm not sure yet how this extension point should look like. Maybe the
> flex APIs are actually already sufficient.
> One major goal of this feature is performance: It ought to be more 
> efficient to e.g. define an attribute that writes and reads a single 
> VInt than storing that VInt as a payload. The payload has the overhead
> of converting the data into a byte array first. An attribute on the other 
> hand should be able to call 'int value = dataInput.readVInt();' directly
> without the byte[] indirection.
> After this part is done I'd like to use a very similar approach for
> column-stride fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2125) Ability to store and retrieve attributes in the inverted index

Reply via email to