Re: Attributes, DocConsumer, Flexible Indexing, etc.

Michael McCandless Thu, 06 Aug 2009 02:49:00 -0700

Agreed.

Grant's idea is something new and I think useful, ie offering some
sort of pluggability of what's stored in payloads, sitting entirely
outside (above) Lucene's core.


Maybe we should call it 'Flexible Payloads', or something, to
differentiate the two.

Mike

On Thu, Aug 6, 2009 at 5:10 AM, Earwin Burrfoot<[email protected]> wrote:
> I always thought flexible indexing is not only for storing your
> app-specific data next to terms/docs.
> Something more along the lines of efficient geo search, or ability to
> try out various index encoding schemes without patching lucene.
>
> In other words, this is something that can be a basis for
> easy/pluggable implementation of payload-type functionality, not
> vice-versa.
>
> On Thu, Aug 6, 2009 at 01:55, Grant Ingersoll<[email protected]> wrote:
>>
>> On Aug 5, 2009, at 4:35 PM, Michael Busch wrote:
>>
>>> On 8/5/09 1:07 PM, Grant Ingersoll wrote:
>>>>
>>>> Hmmm, OK.
>>>>
>>>> Random, somewhat uneducated thought:  Why not just define the codecs to
>>>> create byte arrays?  Then we can use the existing payload capability much
>>>> like I do with the DelimitedPayloadTokenFilter.   We'd probably have to 
>>>> make
>>>> sure this still worked with Similarity, but it seems like it could.
>>>>  Thinking on this some more, seems like this could work already with a a
>>>> AttributePayloadEncoder or something like an AttributeToPayloadTokenFilter
>>>> (I know, horrible name).  Then, on the Query side, the AttributeTermQuery 
>>>> is
>>>> just a glorified BoostingTermQuery with some callback hooks for dealing 
>>>> with
>>>> the Attribute (but maybe that isn't even needed), either that or we just
>>>> provide helper methods to the Similarity class so that people can easily
>>>> decode the byte array into an Attribute.  In fact, maybe all that needs to
>>>> happen is the Attributes need to define encode/decode methods that
>>>> (de)serialize a byte array.
>>>>
>>>> Seems like this approach would require very little in the way of changes
>>>> to Lucene, but I admit it isn't fully baked in my mind just yet.  It also
>>>> has the nice benefit that all the work we did on Payloads isn't wasted.
>>>>
>>>> This is resonating more and more with me.  What do you think?
>>>>
>>>
>>> Well I think this would be a nice way of using the payloads better.
>>>
>>> However, the idea behind flexible indexing is that you can customize the
>>> on-disk encoding in a way that it is as efficient as it can be for your
>>> particular use case. E.g. for payloads we currently have to encode the
>>> length. An application might not have to do that if it knows exactly what is
>>> stored.
>>> Then there's only the Payload API that returns you a byte array. It
>>> basically copies the contents of the IndexInput (usually a
>>> BufferedIndexInput, which means array copy from the byte buffer to the
>>> payload byte array). If the application knows exactly what is stored it can
>>> read/decode it more efficiently.
>>
>> Yeah, but really are you saving that much?  4 bytes per token?  It's not
>> like you are saving much in terms of seeks, since you are already there
>> anyway.  The only downside I see is a slightly larger index.  Would be
>> interesting to try it out and see.
>>
>>
>>
>>
>>>
>>> The latter inefficiency we could solve by improving the payloads API: it
>>> could return an IndexInput instead of the byte array and the caller could
>>> consume it more efficient.
>>
>> This is also interesting, but again requires some changes.  With what I'm
>> proposing, I think it could be done very simply w/o any API changes, and we
>> just need to expose some of the IndexInput/Output helper classes a bit more
>> to make it easier for people to encode/decode their stuff.  Then, just
>> documentation and some more Boosting*Query (Peter has already done
>> BoostingNearQuery) and I think you have a pretty good flexible indexing AND
>> searching capability all in a back compatible way using our existing code.
>>
>>>
>>> So I agree that we could use Attributes to make the payloads feature
>>> better usable, but I don't think it will be a replacement for flexible
>>> indexing.
>>
>>
>>
>>>
>>> Michael
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
>
> --
> Kirill Zakharenko/Кирилл Захаренко ([email protected])
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Attributes, DocConsumer, Flexible Indexing, etc.

Reply via email to