[ https://issues.apache.org/jira/browse/LUCENE-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786854#action_12786854 ]
Michael McCandless commented on LUCENE-2125: -------------------------------------------- I think it makes sense to not treat payloads specially in flex, ie, make it an attr. {quote} Hmm, so the concern is that people have to make the switch to the flex APIs after upgrading to the next Lucene version if they want to create indexes with good old payloads? {quote} Well, not really -- if you stick payloads into your tokens during analysis, presumably the standard (= default) codec would recognize the new payload attr, and store it like normal. Then, any existing queries that do interesting things w/ payloads (PayloadNear/TermQuery), we'd cutover to the new API, and your custom Similarity would still be invoked? It's only if you directly access TermPositions's payload API today, that you'd have to migrate to the new API? But, even then, flex does back compat emulation, so a new index written with the standard codec could be accessed via the old API. BTW probably the attribute should include a "merge" operation, somehow, to be efficient (simply byte[] copying instead of decode/encode) in the merge case. > Ability to store and retrieve attributes in the inverted index > -------------------------------------------------------------- > > Key: LUCENE-2125 > URL: https://issues.apache.org/jira/browse/LUCENE-2125 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: Flex Branch > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: Flex Branch > > > Now that we have the cool attribute-based TokenStream API and also the > great new flexible indexing features, the next logical step is to > allow storing the attributes inline in the posting lists. Currently > this is only supported for the PayloadAttribute. > The flex search APIs already provide an AttributeSource, so there will > be a very clean and performant symmetry. It should be seamlessly > possible for the user to define a new attribute, add it to the > TokenStream, and then retrieve it from the flex search APIs. > What I'm planning to do is to add additional methods to the token > attributes (e.g. by adding a new class TokenAttributeImpl, which > extends AttributeImpl and is the super class of all impls in > o.a.l.a.tokenattributes): > - void serialize(DataOutput) > - void deserialize(DataInput) > - boolean storeInIndex() > The indexer will only call the serialize method of an > TokenAttributeImpl in case its storeInIndex() returns true. > The big advantage here is the ease-of-use: A user can implement in one > place everything necessary to add the attribute to the index. > Btw: I'd like to introduce DataOutput and DataInput as super classes > of IndexOutput and IndexInput. They will contain methods like > readByte(), readVInt(), etc., but methods such as close(), > getFilePointer() etc. will stay in the super classes. > Currently the payload concept is hardcoded in > TermsHashPerField and FreqProxTermsWriterPerField. These classes take > care of copying the contents of the PayloadAttribute over into the > intermediate in-memory postinglist representation and reading it > again. Ideally these classes should not know about specific > attributes, but only call serialze() on those attributes that shall > be stored in the posting list. > We also need to change the PositionsEnum and PositionsConsumer APIs to > deal with attributes instead of payloads. > I think the new codecs should all support storing attributes. Only the > preflex one should be hardcoded to only take the PayloadAttribute into > account. > We'll possibly need another extension point that allows us to influence > compression across multiple postings. Today we use the > length-compression trick for the payloads: if the previous payload had > the same length as the current one, we don't store the length > explicitly again, but only set a bit in the shifted position VInt. Since > often all payloads of one posting list have the same length, this > results in effective compression. > Now an advanced user might want to implement a similar encoding, where > it's not enough to just control serialization of a single value, but > where e.g. the previous position can be taken into account to decide > how to encode a value. > I'm not sure yet how this extension point should look like. Maybe the > flex APIs are actually already sufficient. > One major goal of this feature is performance: It ought to be more > efficient to e.g. define an attribute that writes and reads a single > VInt than storing that VInt as a payload. The payload has the overhead > of converting the data into a byte array first. An attribute on the other > hand should be able to call 'int value = dataInput.readVInt();' directly > without the byte[] indirection. > After this part is done I'd like to use a very similar approach for > column-stride fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org