[ https://issues.apache.org/jira/browse/LUCENE-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842259#action_12842259 ]
Uwe Schindler commented on LUCENE-2125: --------------------------------------- I would prefer to not extend AttributeImpl but more make the attribute simply extend another interface: SerializableAttribute that provides input/output methods. Docinverter can then just check with instanceof, if the attribute is to be stored in index. This would also help with ProxyAttributes (LUCENE-2154). > Ability to store and retrieve attributes in the inverted index > -------------------------------------------------------------- > > Key: LUCENE-2125 > URL: https://issues.apache.org/jira/browse/LUCENE-2125 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: Flex Branch > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: Flex Branch > > > Now that we have the cool attribute-based TokenStream API and also the > great new flexible indexing features, the next logical step is to > allow storing the attributes inline in the posting lists. Currently > this is only supported for the PayloadAttribute. > The flex search APIs already provide an AttributeSource, so there will > be a very clean and performant symmetry. It should be seamlessly > possible for the user to define a new attribute, add it to the > TokenStream, and then retrieve it from the flex search APIs. > What I'm planning to do is to add additional methods to the token > attributes (e.g. by adding a new class TokenAttributeImpl, which > extends AttributeImpl and is the super class of all impls in > o.a.l.a.tokenattributes): > - void serialize(DataOutput) > - void deserialize(DataInput) > - boolean storeInIndex() > The indexer will only call the serialize method of an > TokenAttributeImpl in case its storeInIndex() returns true. > The big advantage here is the ease-of-use: A user can implement in one > place everything necessary to add the attribute to the index. > Btw: I'd like to introduce DataOutput and DataInput as super classes > of IndexOutput and IndexInput. They will contain methods like > readByte(), readVInt(), etc., but methods such as close(), > getFilePointer() etc. will stay in the super classes. > Currently the payload concept is hardcoded in > TermsHashPerField and FreqProxTermsWriterPerField. These classes take > care of copying the contents of the PayloadAttribute over into the > intermediate in-memory postinglist representation and reading it > again. Ideally these classes should not know about specific > attributes, but only call serialze() on those attributes that shall > be stored in the posting list. > We also need to change the PositionsEnum and PositionsConsumer APIs to > deal with attributes instead of payloads. > I think the new codecs should all support storing attributes. Only the > preflex one should be hardcoded to only take the PayloadAttribute into > account. > We'll possibly need another extension point that allows us to influence > compression across multiple postings. Today we use the > length-compression trick for the payloads: if the previous payload had > the same length as the current one, we don't store the length > explicitly again, but only set a bit in the shifted position VInt. Since > often all payloads of one posting list have the same length, this > results in effective compression. > Now an advanced user might want to implement a similar encoding, where > it's not enough to just control serialization of a single value, but > where e.g. the previous position can be taken into account to decide > how to encode a value. > I'm not sure yet how this extension point should look like. Maybe the > flex APIs are actually already sufficient. > One major goal of this feature is performance: It ought to be more > efficient to e.g. define an attribute that writes and reads a single > VInt than storing that VInt as a payload. The payload has the overhead > of converting the data into a byte array first. An attribute on the other > hand should be able to call 'int value = dataInput.readVInt();' directly > without the byte[] indirection. > After this part is done I'd like to use a very similar approach for > column-stride fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org