Ability to store and retrieve attributes in the inverted index --------------------------------------------------------------
Key: LUCENE-2125 URL: https://issues.apache.org/jira/browse/LUCENE-2125 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: Flex Branch Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Flex Branch Now that we have the cool attribute-based TokenStream API and also the great new flexible indexing features, the next logical step is to allow storing the attributes inline in the posting lists. Currently this is only supported for the PayloadAttribute. The flex search APIs already provide an AttributeSource, so there will be a very clean and performant symmetry. It should be seamlessly possible for the user to define a new attribute, add it to the TokenStream, and then retrieve it from the flex search APIs. What I'm planning to do is to add additional methods to the token attributes (e.g. by adding a new class TokenAttributeImpl, which extends AttributeImpl and is the super class of all impls in o.a.l.a.tokenattributes): - void serialize(DataOutput) - void deserialize(DataInput) - boolean storeInIndex() The indexer will only call the serialize method of an TokenAttributeImpl in case its storeInIndex() returns true. The big advantage here is the ease-of-use: A user can implement in one place everything necessary to add the attribute to the index. Btw: I'd like to introduce DataOutput and DataInput as super classes of IndexOutput and IndexInput. They will contain methods like readByte(), readVInt(), etc., but methods such as close(), getFilePointer() etc. will stay in the super classes. Currently the payload concept is hardcoded in TermsHashPerField and FreqProxTermsWriterPerField. These classes take care of copying the contents of the PayloadAttribute over into the intermediate in-memory postinglist representation and reading it again. Ideally these classes should not know about specific attributes, but only call serialze() on those attributes that shall be stored in the posting list. We also need to change the PositionsEnum and PositionsConsumer APIs to deal with attributes instead of payloads. I think the new codecs should all support storing attributes. Only the preflex one should be hardcoded to only take the PayloadAttribute into account. We'll possibly need another extension point that allows us to influence compression across multiple postings. Today we use the length-compression trick for the payloads: if the previous payload had the same length as the current one, we don't store the length explicitly again, but only set a bit in the shifted position VInt. Since often all payloads of one posting list have the same length, this results in effective compression. Now an advanced user might want to implement a similar encoding, where it's not enough to just control serialization of a single value, but where e.g. the previous position can be taken into account to decide how to encode a value. I'm not sure yet how this extension point should look like. Maybe the flex APIs are actually already sufficient. One major goal of this feature is performance: It ought to be more efficient to e.g. define an attribute that writes and reads a single VInt than storing that VInt as a payload. The payload has the overhead of converting the data into a byte array first. An attribute on the other hand should be able to call 'int value = dataInput.readVInt();' directly without the byte[] indirection. After this part is done I'd like to use a very similar approach for column-stride fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org