Re: Iterating BinaryDocValues

Joel Bernstein Fri, 10 Jan 2014 12:45:25 -0800

For the test I ran, I just timed the number of docId->bytesRef lookups I
could do in a second.


Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 3:41 PM, Robert Muir <[email protected]> wrote:

> Are you sure its not the wire serialization etc causing the bottleneck
> (e.g. converting to utf-8 string and back, network traffic, json encoding,
> etc etc?)
>
>
> On Fri, Jan 10, 2014 at 3:39 PM, Joel Bernstein <[email protected]>wrote:
>
>> Bulk extracting full unsorted result sets from Solr. You give Solr a
>> query and it dumps the full result in a single call. The result set
>> streaming is in place, but throughput is not as good as I would like it.
>>
>> Joel Bernstein
>> Search Engineer at Heliosearch
>>
>>
>> On Fri, Jan 10, 2014 at 3:24 PM, Robert Muir <[email protected]> wrote:
>>
>>> what are you doing with the data?
>>>
>>>
>>> On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[email protected]>wrote:
>>>
>>>> I'll provide a little more context. I'm working on bulk extracting
>>>> BinaryDocValues. My initial performance test was with in-memory
>>>> binaryDocValues, but I think the end game is actually disk-based
>>>> binaryDocValues.
>>>>
>>>> I was able to perform around 1 million docId->BytesRef lookups
>>>> per-second with in-memory BinaryDocValues. Since I need to get the values
>>>> for multiple fields for each document, this bogs down pretty quickly.
>>>>
>>>> I'm wondering if there is a way to increase this throughput. Since
>>>> filling a BytesRef is pretty fast, I was assuming it was the seek that was
>>>> taking the time, but I didn't verify this. The first thing that came to
>>>> mind is iterating the docValues in such a way that the next docValue could
>>>> be loaded without a seek. But I haven't dug into how the BinaryDocValues
>>>> are formatted so I'm not sure if this would help or not. Also there could
>>>> be something else besides the seek that is limiting the throughput.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Joel Bernstein
>>>> Search Engineer at Heliosearch
>>>>
>>>>
>>>> On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[email protected]> wrote:
>>>>
>>>>> Yeah, i dont think its from newer docvalues-using code like yours shai.
>>>>>
>>>>> instead the problems i had doing this are historical, because e.g.
>>>>> fieldcache pointed to large arrays and consumers were lazy about it,
>>>>> knowing that there reference pointed to bytes that would remain valid
>>>>> across invocations.
>>>>>
>>>>> we just have to remove these assumptions. I don't apologize for not
>>>>> doing this, as you show, its some small % improvement (which we should go
>>>>> and get back!), but i went with safety first initially rather than bugs.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[email protected]> wrote:
>>>>>
>>>>>> I agree with Robert. We should leave cloning BytesRefs to whoever
>>>>>> needs that, and not penalize everyone else who don't need it. I must say 
>>>>>> I
>>>>>> didn't know I can "own" those BytesRefs and I clone them whenever I need
>>>>>> to. I think I was bitten by one of the other APIs, so I assumed returned
>>>>>> BytesRefs are not "mine" across all the APIs.
>>>>>>
>>>>>> Shai
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[email protected]>wrote:
>>>>>>
>>>>>>> the problem is really simpler to solve actually.
>>>>>>>
>>>>>>> Look at the comments in the code, it tells you why it is this way:
>>>>>>>
>>>>>>>           // NOTE: we could have one buffer, but various consumers
>>>>>>> (e.g. FieldComparatorSource)
>>>>>>>           // assume "they" own the bytes after calling this!
>>>>>>>
>>>>>>> That is what we should fix. There is no need to make bulk APIs or
>>>>>>> even change the public api in any way (other than javadocs).
>>>>>>>
>>>>>>> We just move the clone'ing out of the codec, and require the
>>>>>>> consumer to do it, same as termsenum or other apis. The codec part is
>>>>>>> extremely simple here, its even the way i had it initially.
>>>>>>>
>>>>>>> But at the time (and even still now) this comes with some risk of
>>>>>>> bugs. So initially I removed the reuse and went with a more conservative
>>>>>>> approach to start with.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Adrian,
>>>>>>>>
>>>>>>>> Please find bulkGet() scratch. It's ugly copy-paste, just reuses
>>>>>>>> ByteRef that provides 10% gain.
>>>>>>>> ...
>>>>>>>> bulkGet took:101630 ms
>>>>>>>> ...
>>>>>>>> get took:114422 ms
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> I don't think we should add such a method. Doc values are commonly
>>>>>>>>> read from collectors, so why do we need a method that works on top
>>>>>>>>> of
>>>>>>>>> a DocIdSetIterator? I'm also curious how specialized
>>>>>>>>> implementations
>>>>>>>>> could make this method faster than the default implementation?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Adrien
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sincerely yours
>>>>>>>> Mikhail Khludnev
>>>>>>>> Principal Engineer,
>>>>>>>> Grid Dynamics
>>>>>>>>
>>>>>>>> <http://www.griddynamics.com>
>>>>>>>>  <[email protected]>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Iterating BinaryDocValues

Reply via email to