Re: Attributes, DocConsumer, Flexible Indexing, etc.

Grant Ingersoll Thu, 06 Aug 2009 05:19:33 -0700


On Aug 6, 2009, at 5:48 AM, Michael McCandless wrote:

Agreed.

Yes, the ability to do things like implement Okapi, Language Modelingor very sparse indexes (although we kind of have that already) wouldnot fit in with this stuff. Of course, those couldn't be solvedthrough the Attribute stuff anyway.


Grant's idea is something new and I think useful, ie offering some
sort of pluggability of what's stored in payloads, sitting entirely
outside (above) Lucene's core.

Maybe we should call it 'Flexible Payloads', or something, to
differentiate the two.

Or just Attribute as Payloads (but I'm horrible at naming).Primarily, I want to be able to take advantage of all this greatAttribute work now ;-)

Mike
On Thu, Aug 6, 2009 at 5:10 AM, Earwin Burrfoot<ear...@gmail.com>wrote:
I always thought flexible indexing is not only for storing your
app-specific data next to terms/docs.
Something more along the lines of efficient geo search, or ability to
try out various index encoding schemes without patching lucene.

In other words, this is something that can be a basis for
easy/pluggable implementation of payload-type functionality, not
vice-versa.
On Thu, Aug 6, 2009 at 01:55, Grant Ingersoll<gsing...@apache.org>wrote:
On Aug 5, 2009, at 4:35 PM, Michael Busch wrote:
On 8/5/09 1:07 PM, Grant Ingersoll wrote:
Hmmm, OK.
Random, somewhat uneducated thought: Why not just define thecodecs tocreate byte arrays? Then we can use the existing payloadcapability muchlike I do with the DelimitedPayloadTokenFilter. We'd probablyhave to makesure this still worked with Similarity, but it seems like itcould.Thinking on this some more, seems like this could work alreadywith a aAttributePayloadEncoder or something like anAttributeToPayloadTokenFilter(I know, horrible name). Then, on the Query side, theAttributeTermQuery isjust a glorified BoostingTermQuery with some callback hooks fordealing withthe Attribute (but maybe that isn't even needed), either that orwe justprovide helper methods to the Similarity class so that peoplecan easilydecode the byte array into an Attribute. In fact, maybe allthat needs to
happen is the Attributes need to define encode/decode methods that
(de)serialize a byte array.
Seems like this approach would require very little in the way ofchangesto Lucene, but I admit it isn't fully baked in my mind justyet. It alsohas the nice benefit that all the work we did on Payloads isn'twasted.
This is resonating more and more with me.  What do you think?
Well I think this would be a nice way of using the payloads better.
However, the idea behind flexible indexing is that you cancustomize theon-disk encoding in a way that it is as efficient as it can befor yourparticular use case. E.g. for payloads we currently have toencode thelength. An application might not have to do that if it knowsexactly what is
stored.
Then there's only the Payload API that returns you a byte array. It
basically copies the contents of the IndexInput (usually a
BufferedIndexInput, which means array copy from the byte bufferto thepayload byte array). If the application knows exactly what isstored it can
read/decode it more efficiently.
Yeah, but really are you saving that much? 4 bytes per token?It's notlike you are saving much in terms of seeks, since you are alreadythereanyway. The only downside I see is a slightly larger index.Would be
interesting to try it out and see.
The latter inefficiency we could solve by improving the payloadsAPI: itcould return an IndexInput instead of the byte array and thecaller could
consume it more efficient.
This is also interesting, but again requires some changes. Withwhat I'mproposing, I think it could be done very simply w/o any APIchanges, and wejust need to expose some of the IndexInput/Output helper classes abit moreto make it easier for people to encode/decode their stuff. Then,just
documentation and some more Boosting*Query (Peter has already done
BoostingNearQuery) and I think you have a pretty good flexibleindexing ANDsearching capability all in a back compatible way using ourexisting code.
So I agree that we could use Attributes to make the payloadsfeaturebetter usable, but I don't think it will be a replacement forflexible
indexing.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Attributes, DocConsumer, Flexible Indexing, etc.

Reply via email to