what are you storing in the binaryDV?
On Fri, Jan 10, 2014 at 3:44 PM, Joel Bernstein <[email protected]> wrote: > For the test I ran, I just timed the number of docId->bytesRef lookups I > could do in a second. > > Joel Bernstein > Search Engineer at Heliosearch > > > On Fri, Jan 10, 2014 at 3:41 PM, Robert Muir <[email protected]> wrote: > >> Are you sure its not the wire serialization etc causing the bottleneck >> (e.g. converting to utf-8 string and back, network traffic, json encoding, >> etc etc?) >> >> >> On Fri, Jan 10, 2014 at 3:39 PM, Joel Bernstein <[email protected]>wrote: >> >>> Bulk extracting full unsorted result sets from Solr. You give Solr a >>> query and it dumps the full result in a single call. The result set >>> streaming is in place, but throughput is not as good as I would like it. >>> >>> Joel Bernstein >>> Search Engineer at Heliosearch >>> >>> >>> On Fri, Jan 10, 2014 at 3:24 PM, Robert Muir <[email protected]> wrote: >>> >>>> what are you doing with the data? >>>> >>>> >>>> On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[email protected]>wrote: >>>> >>>>> I'll provide a little more context. I'm working on bulk extracting >>>>> BinaryDocValues. My initial performance test was with in-memory >>>>> binaryDocValues, but I think the end game is actually disk-based >>>>> binaryDocValues. >>>>> >>>>> I was able to perform around 1 million docId->BytesRef lookups >>>>> per-second with in-memory BinaryDocValues. Since I need to get the values >>>>> for multiple fields for each document, this bogs down pretty quickly. >>>>> >>>>> I'm wondering if there is a way to increase this throughput. Since >>>>> filling a BytesRef is pretty fast, I was assuming it was the seek that was >>>>> taking the time, but I didn't verify this. The first thing that came to >>>>> mind is iterating the docValues in such a way that the next docValue could >>>>> be loaded without a seek. But I haven't dug into how the BinaryDocValues >>>>> are formatted so I'm not sure if this would help or not. Also there could >>>>> be something else besides the seek that is limiting the throughput. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Joel Bernstein >>>>> Search Engineer at Heliosearch >>>>> >>>>> >>>>> On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[email protected]> wrote: >>>>> >>>>>> Yeah, i dont think its from newer docvalues-using code like yours >>>>>> shai. >>>>>> >>>>>> instead the problems i had doing this are historical, because e.g. >>>>>> fieldcache pointed to large arrays and consumers were lazy about it, >>>>>> knowing that there reference pointed to bytes that would remain valid >>>>>> across invocations. >>>>>> >>>>>> we just have to remove these assumptions. I don't apologize for not >>>>>> doing this, as you show, its some small % improvement (which we should go >>>>>> and get back!), but i went with safety first initially rather than bugs. >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[email protected]> wrote: >>>>>> >>>>>>> I agree with Robert. We should leave cloning BytesRefs to whoever >>>>>>> needs that, and not penalize everyone else who don't need it. I must >>>>>>> say I >>>>>>> didn't know I can "own" those BytesRefs and I clone them whenever I need >>>>>>> to. I think I was bitten by one of the other APIs, so I assumed returned >>>>>>> BytesRefs are not "mine" across all the APIs. >>>>>>> >>>>>>> Shai >>>>>>> >>>>>>> >>>>>>> On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[email protected]>wrote: >>>>>>> >>>>>>>> the problem is really simpler to solve actually. >>>>>>>> >>>>>>>> Look at the comments in the code, it tells you why it is this way: >>>>>>>> >>>>>>>> // NOTE: we could have one buffer, but various consumers >>>>>>>> (e.g. FieldComparatorSource) >>>>>>>> // assume "they" own the bytes after calling this! >>>>>>>> >>>>>>>> That is what we should fix. There is no need to make bulk APIs or >>>>>>>> even change the public api in any way (other than javadocs). >>>>>>>> >>>>>>>> We just move the clone'ing out of the codec, and require the >>>>>>>> consumer to do it, same as termsenum or other apis. The codec part is >>>>>>>> extremely simple here, its even the way i had it initially. >>>>>>>> >>>>>>>> But at the time (and even still now) this comes with some risk of >>>>>>>> bugs. So initially I removed the reuse and went with a more >>>>>>>> conservative >>>>>>>> approach to start with. >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Adrian, >>>>>>>>> >>>>>>>>> Please find bulkGet() scratch. It's ugly copy-paste, just reuses >>>>>>>>> ByteRef that provides 10% gain. >>>>>>>>> ... >>>>>>>>> bulkGet took:101630 ms >>>>>>>>> ... >>>>>>>>> get took:114422 ms >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand >>>>>>>>> <[email protected]>wrote: >>>>>>>>> >>>>>>>>>> I don't think we should add such a method. Doc values are commonly >>>>>>>>>> read from collectors, so why do we need a method that works on >>>>>>>>>> top of >>>>>>>>>> a DocIdSetIterator? I'm also curious how specialized >>>>>>>>>> implementations >>>>>>>>>> could make this method faster than the default implementation? >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Adrien >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Sincerely yours >>>>>>>>> Mikhail Khludnev >>>>>>>>> Principal Engineer, >>>>>>>>> Grid Dynamics >>>>>>>>> >>>>>>>>> <http://www.griddynamics.com> >>>>>>>>> <[email protected]> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
