Re: Attributes, DocConsumer, Flexible Indexing, etc.

Grant Ingersoll Wed, 05 Aug 2009 14:55:51 -0700


On Aug 5, 2009, at 4:35 PM, Michael Busch wrote:

On 8/5/09 1:07 PM, Grant Ingersoll wrote:
Hmmm, OK.
Random, somewhat uneducated thought: Why not just define thecodecs to create byte arrays? Then we can use the existing payloadcapability much like I do with the DelimitedPayloadTokenFilter.We'd probably have to make sure this still worked with Similarity,but it seems like it could. Thinking on this some more, seems likethis could work already with a a AttributePayloadEncoder orsomething like an AttributeToPayloadTokenFilter (I know, horriblename). Then, on the Query side, the AttributeTermQuery is just aglorified BoostingTermQuery with some callback hooks for dealingwith the Attribute (but maybe that isn't even needed), either thator we just provide helper methods to the Similarity class so thatpeople can easily decode the byte array into an Attribute. Infact, maybe all that needs to happen is the Attributes need todefine encode/decode methods that (de)serialize a byte array.
Seems like this approach would require very little in the way ofchanges to Lucene, but I admit it isn't fully baked in my mind justyet. It also has the nice benefit that all the work we did onPayloads isn't wasted.
This is resonating more and more with me.  What do you think?
Well I think this would be a nice way of using the payloads better.
However, the idea behind flexible indexing is that you can customizethe on-disk encoding in a way that it is as efficient as it can befor your particular use case. E.g. for payloads we currently have toencode the length. An application might not have to do that if itknows exactly what is stored.Then there's only the Payload API that returns you a byte array. Itbasically copies the contents of the IndexInput (usually aBufferedIndexInput, which means array copy from the byte buffer tothe payload byte array). If the application knows exactly what isstored it can read/decode it more efficiently.

Yeah, but really are you saving that much? 4 bytes per token? It'snot like you are saving much in terms of seeks, since you are alreadythere anyway. The only downside I see is a slightly larger index.Would be interesting to try it out and see.

The latter inefficiency we could solve by improving the payloadsAPI: it could return an IndexInput instead of the byte array and thecaller could consume it more efficient.

This is also interesting, but again requires some changes. With whatI'm proposing, I think it could be done very simply w/o any APIchanges, and we just need to expose some of the IndexInput/Outputhelper classes a bit more to make it easier for people to encode/decode their stuff. Then, just documentation and some moreBoosting*Query (Peter has already done BoostingNearQuery) and I thinkyou have a pretty good flexible indexing AND searching capability allin a back compatible way using our existing code.

So I agree that we could use Attributes to make the payloads featurebetter usable, but I don't think it will be a replacement forflexible indexing.


Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Attributes, DocConsumer, Flexible Indexing, etc.

Reply via email to