Re: Iterating BinaryDocValues

Joel Bernstein Fri, 10 Jan 2014 13:28:09 -0800

The text value for the field. It's a generic bulk extract tool so whatever
the user loads.


Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 4:08 PM, Robert Muir <[email protected]> wrote:

> what are you storing in the binaryDV?
>
>
> On Fri, Jan 10, 2014 at 3:44 PM, Joel Bernstein <[email protected]>wrote:
>
>> For the test I ran, I just timed the number of docId->bytesRef lookups I
>> could do in a second.
>>
>> Joel Bernstein
>> Search Engineer at Heliosearch
>>
>>
>> On Fri, Jan 10, 2014 at 3:41 PM, Robert Muir <[email protected]> wrote:
>>
>>> Are you sure its not the wire serialization etc causing the bottleneck
>>> (e.g. converting to utf-8 string and back, network traffic, json encoding,
>>> etc etc?)
>>>
>>>
>>> On Fri, Jan 10, 2014 at 3:39 PM, Joel Bernstein <[email protected]>wrote:
>>>
>>>> Bulk extracting full unsorted result sets from Solr. You give Solr a
>>>> query and it dumps the full result in a single call. The result set
>>>> streaming is in place, but throughput is not as good as I would like it.
>>>>
>>>> Joel Bernstein
>>>> Search Engineer at Heliosearch
>>>>
>>>>
>>>> On Fri, Jan 10, 2014 at 3:24 PM, Robert Muir <[email protected]> wrote:
>>>>
>>>>> what are you doing with the data?
>>>>>
>>>>>
>>>>> On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[email protected]>wrote:
>>>>>
>>>>>> I'll provide a little more context. I'm working on bulk extracting
>>>>>> BinaryDocValues. My initial performance test was with in-memory
>>>>>> binaryDocValues, but I think the end game is actually disk-based
>>>>>> binaryDocValues.
>>>>>>
>>>>>> I was able to perform around 1 million docId->BytesRef lookups
>>>>>> per-second with in-memory BinaryDocValues. Since I need to get the values
>>>>>> for multiple fields for each document, this bogs down pretty quickly.
>>>>>>
>>>>>> I'm wondering if there is a way to increase this throughput. Since
>>>>>> filling a BytesRef is pretty fast, I was assuming it was the seek that 
>>>>>> was
>>>>>> taking the time, but I didn't verify this. The first thing that came to
>>>>>> mind is iterating the docValues in such a way that the next docValue 
>>>>>> could
>>>>>> be loaded without a seek. But I haven't dug into how the BinaryDocValues
>>>>>> are formatted so I'm not sure if this would help or not. Also there could
>>>>>> be something else besides the seek that is limiting the throughput.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Joel Bernstein
>>>>>> Search Engineer at Heliosearch
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[email protected]>wrote:
>>>>>>
>>>>>>> Yeah, i dont think its from newer docvalues-using code like yours
>>>>>>> shai.
>>>>>>>
>>>>>>> instead the problems i had doing this are historical, because e.g.
>>>>>>> fieldcache pointed to large arrays and consumers were lazy about it,
>>>>>>> knowing that there reference pointed to bytes that would remain valid
>>>>>>> across invocations.
>>>>>>>
>>>>>>> we just have to remove these assumptions. I don't apologize for not
>>>>>>> doing this, as you show, its some small % improvement (which we should 
>>>>>>> go
>>>>>>> and get back!), but i went with safety first initially rather than bugs.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[email protected]>wrote:
>>>>>>>
>>>>>>>> I agree with Robert. We should leave cloning BytesRefs to whoever
>>>>>>>> needs that, and not penalize everyone else who don't need it. I must 
>>>>>>>> say I
>>>>>>>> didn't know I can "own" those BytesRefs and I clone them whenever I 
>>>>>>>> need
>>>>>>>> to. I think I was bitten by one of the other APIs, so I assumed 
>>>>>>>> returned
>>>>>>>> BytesRefs are not "mine" across all the APIs.
>>>>>>>>
>>>>>>>> Shai
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> the problem is really simpler to solve actually.
>>>>>>>>>
>>>>>>>>> Look at the comments in the code, it tells you why it is this way:
>>>>>>>>>
>>>>>>>>>           // NOTE: we could have one buffer, but various consumers
>>>>>>>>> (e.g. FieldComparatorSource)
>>>>>>>>>           // assume "they" own the bytes after calling this!
>>>>>>>>>
>>>>>>>>> That is what we should fix. There is no need to make bulk APIs or
>>>>>>>>> even change the public api in any way (other than javadocs).
>>>>>>>>>
>>>>>>>>> We just move the clone'ing out of the codec, and require the
>>>>>>>>> consumer to do it, same as termsenum or other apis. The codec part is
>>>>>>>>> extremely simple here, its even the way i had it initially.
>>>>>>>>>
>>>>>>>>> But at the time (and even still now) this comes with some risk of
>>>>>>>>> bugs. So initially I removed the reuse and went with a more 
>>>>>>>>> conservative
>>>>>>>>> approach to start with.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Adrian,
>>>>>>>>>>
>>>>>>>>>> Please find bulkGet() scratch. It's ugly copy-paste, just reuses
>>>>>>>>>> ByteRef that provides 10% gain.
>>>>>>>>>> ...
>>>>>>>>>> bulkGet took:101630 ms
>>>>>>>>>> ...
>>>>>>>>>> get took:114422 ms
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand 
>>>>>>>>>> <[email protected]>wrote:
>>>>>>>>>>
>>>>>>>>>>> I don't think we should add such a method. Doc values are
>>>>>>>>>>> commonly
>>>>>>>>>>> read from collectors, so why do we need a method that works on
>>>>>>>>>>> top of
>>>>>>>>>>> a DocIdSetIterator? I'm also curious how specialized
>>>>>>>>>>> implementations
>>>>>>>>>>> could make this method faster than the default implementation?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Adrien
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Sincerely yours
>>>>>>>>>> Mikhail Khludnev
>>>>>>>>>> Principal Engineer,
>>>>>>>>>> Grid Dynamics
>>>>>>>>>>
>>>>>>>>>> <http://www.griddynamics.com>
>>>>>>>>>>  <[email protected]>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Iterating BinaryDocValues

Reply via email to