Re: Iterating BinaryDocValues

Joel Bernstein Fri, 10 Jan 2014 12:43:25 -0800

Bulk extracting full unsorted result sets from Solr. You give Solr a query
and it dumps the full result in a single call. The result set streaming is
in place, but throughput is not as good as I would like it.


Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 3:24 PM, Robert Muir <[email protected]> wrote:

> what are you doing with the data?
>
>
> On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[email protected]>wrote:
>
>> I'll provide a little more context. I'm working on bulk extracting
>> BinaryDocValues. My initial performance test was with in-memory
>> binaryDocValues, but I think the end game is actually disk-based
>> binaryDocValues.
>>
>> I was able to perform around 1 million docId->BytesRef lookups per-second
>> with in-memory BinaryDocValues. Since I need to get the values for multiple
>> fields for each document, this bogs down pretty quickly.
>>
>> I'm wondering if there is a way to increase this throughput. Since
>> filling a BytesRef is pretty fast, I was assuming it was the seek that was
>> taking the time, but I didn't verify this. The first thing that came to
>> mind is iterating the docValues in such a way that the next docValue could
>> be loaded without a seek. But I haven't dug into how the BinaryDocValues
>> are formatted so I'm not sure if this would help or not. Also there could
>> be something else besides the seek that is limiting the throughput.
>>
>>
>>
>>
>>
>>
>>
>>
>> Joel Bernstein
>> Search Engineer at Heliosearch
>>
>>
>> On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[email protected]> wrote:
>>
>>> Yeah, i dont think its from newer docvalues-using code like yours shai.
>>>
>>> instead the problems i had doing this are historical, because e.g.
>>> fieldcache pointed to large arrays and consumers were lazy about it,
>>> knowing that there reference pointed to bytes that would remain valid
>>> across invocations.
>>>
>>> we just have to remove these assumptions. I don't apologize for not
>>> doing this, as you show, its some small % improvement (which we should go
>>> and get back!), but i went with safety first initially rather than bugs.
>>>
>>>
>>>
>>> On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[email protected]> wrote:
>>>
>>>> I agree with Robert. We should leave cloning BytesRefs to whoever needs
>>>> that, and not penalize everyone else who don't need it. I must say I didn't
>>>> know I can "own" those BytesRefs and I clone them whenever I need to. I
>>>> think I was bitten by one of the other APIs, so I assumed returned
>>>> BytesRefs are not "mine" across all the APIs.
>>>>
>>>> Shai
>>>>
>>>>
>>>> On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[email protected]> wrote:
>>>>
>>>>> the problem is really simpler to solve actually.
>>>>>
>>>>> Look at the comments in the code, it tells you why it is this way:
>>>>>
>>>>>           // NOTE: we could have one buffer, but various consumers
>>>>> (e.g. FieldComparatorSource)
>>>>>           // assume "they" own the bytes after calling this!
>>>>>
>>>>> That is what we should fix. There is no need to make bulk APIs or even
>>>>> change the public api in any way (other than javadocs).
>>>>>
>>>>> We just move the clone'ing out of the codec, and require the consumer
>>>>> to do it, same as termsenum or other apis. The codec part is extremely
>>>>> simple here, its even the way i had it initially.
>>>>>
>>>>> But at the time (and even still now) this comes with some risk of
>>>>> bugs. So initially I removed the reuse and went with a more conservative
>>>>> approach to start with.
>>>>>
>>>>>
>>>>> On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Adrian,
>>>>>>
>>>>>> Please find bulkGet() scratch. It's ugly copy-paste, just reuses
>>>>>> ByteRef that provides 10% gain.
>>>>>> ...
>>>>>> bulkGet took:101630 ms
>>>>>> ...
>>>>>> get took:114422 ms
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[email protected]>wrote:
>>>>>>
>>>>>>> I don't think we should add such a method. Doc values are commonly
>>>>>>> read from collectors, so why do we need a method that works on top of
>>>>>>> a DocIdSetIterator? I'm also curious how specialized implementations
>>>>>>> could make this method faster than the default implementation?
>>>>>>>
>>>>>>> --
>>>>>>> Adrien
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sincerely yours
>>>>>> Mikhail Khludnev
>>>>>> Principal Engineer,
>>>>>> Grid Dynamics
>>>>>>
>>>>>> <http://www.griddynamics.com>
>>>>>>  <[email protected]>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Iterating BinaryDocValues

Reply via email to