[ https://issues.apache.org/jira/browse/LUCENE-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786848#action_12786848 ]
Michael Busch commented on LUCENE-2125: --------------------------------------- {quote} So you'd remove the explicit payload methods in PositionsEnum? Ie, users on migrating to flex would have to switch to the payloads attribute? {quote} I think that would make sense? Payloads don't have to be treated specially anymore, if any attribute can be stored in the posting lists. {quote} Note the that preflex codec only has a reader (FieldsProducer), not a writer. Ie you can read the old index format but not write it. {quote} Hmm, so the concern is that people *have* to make the switch to the flex APIs after upgrading to the next Lucene version if they want to create indexes with good old payloads? {quote} Ideally the serialize/unserialize could efficiently handle the fixed-length case without using up the 1 bit in the index. {quote} Yes! {quote} I wonder if we need to allow codecs to store data into SegmentInfo/FieldInfo for this (we don't now). {quote} IMO we definitely do. E.g. for backwards-compatibility: if users switch the encoding of an attribute, then they need a way to determine in which format it is stored in a given segment. And we need to open up FieldInfo too: it has to store which and in what order the attributes are stored. I'm sure these are the things you had in mind too? > Ability to store and retrieve attributes in the inverted index > -------------------------------------------------------------- > > Key: LUCENE-2125 > URL: https://issues.apache.org/jira/browse/LUCENE-2125 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: Flex Branch > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: Flex Branch > > > Now that we have the cool attribute-based TokenStream API and also the > great new flexible indexing features, the next logical step is to > allow storing the attributes inline in the posting lists. Currently > this is only supported for the PayloadAttribute. > The flex search APIs already provide an AttributeSource, so there will > be a very clean and performant symmetry. It should be seamlessly > possible for the user to define a new attribute, add it to the > TokenStream, and then retrieve it from the flex search APIs. > What I'm planning to do is to add additional methods to the token > attributes (e.g. by adding a new class TokenAttributeImpl, which > extends AttributeImpl and is the super class of all impls in > o.a.l.a.tokenattributes): > - void serialize(DataOutput) > - void deserialize(DataInput) > - boolean storeInIndex() > The indexer will only call the serialize method of an > TokenAttributeImpl in case its storeInIndex() returns true. > The big advantage here is the ease-of-use: A user can implement in one > place everything necessary to add the attribute to the index. > Btw: I'd like to introduce DataOutput and DataInput as super classes > of IndexOutput and IndexInput. They will contain methods like > readByte(), readVInt(), etc., but methods such as close(), > getFilePointer() etc. will stay in the super classes. > Currently the payload concept is hardcoded in > TermsHashPerField and FreqProxTermsWriterPerField. These classes take > care of copying the contents of the PayloadAttribute over into the > intermediate in-memory postinglist representation and reading it > again. Ideally these classes should not know about specific > attributes, but only call serialze() on those attributes that shall > be stored in the posting list. > We also need to change the PositionsEnum and PositionsConsumer APIs to > deal with attributes instead of payloads. > I think the new codecs should all support storing attributes. Only the > preflex one should be hardcoded to only take the PayloadAttribute into > account. > We'll possibly need another extension point that allows us to influence > compression across multiple postings. Today we use the > length-compression trick for the payloads: if the previous payload had > the same length as the current one, we don't store the length > explicitly again, but only set a bit in the shifted position VInt. Since > often all payloads of one posting list have the same length, this > results in effective compression. > Now an advanced user might want to implement a similar encoding, where > it's not enough to just control serialization of a single value, but > where e.g. the previous position can be taken into account to decide > how to encode a value. > I'm not sure yet how this extension point should look like. Maybe the > flex APIs are actually already sufficient. > One major goal of this feature is performance: It ought to be more > efficient to e.g. define an attribute that writes and reads a single > VInt than storing that VInt as a payload. The payload has the overhead > of converting the data into a byte array first. An attribute on the other > hand should be able to call 'int value = dataInput.readVInt();' directly > without the byte[] indirection. > After this part is done I'd like to use a very similar approach for > column-stride fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org