For the test I ran, I just timed the number of docId->bytesRef lookups I could do in a second.
Joel Bernstein Search Engineer at Heliosearch On Fri, Jan 10, 2014 at 3:41 PM, Robert Muir <[email protected]> wrote: > Are you sure its not the wire serialization etc causing the bottleneck > (e.g. converting to utf-8 string and back, network traffic, json encoding, > etc etc?) > > > On Fri, Jan 10, 2014 at 3:39 PM, Joel Bernstein <[email protected]>wrote: > >> Bulk extracting full unsorted result sets from Solr. You give Solr a >> query and it dumps the full result in a single call. The result set >> streaming is in place, but throughput is not as good as I would like it. >> >> Joel Bernstein >> Search Engineer at Heliosearch >> >> >> On Fri, Jan 10, 2014 at 3:24 PM, Robert Muir <[email protected]> wrote: >> >>> what are you doing with the data? >>> >>> >>> On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[email protected]>wrote: >>> >>>> I'll provide a little more context. I'm working on bulk extracting >>>> BinaryDocValues. My initial performance test was with in-memory >>>> binaryDocValues, but I think the end game is actually disk-based >>>> binaryDocValues. >>>> >>>> I was able to perform around 1 million docId->BytesRef lookups >>>> per-second with in-memory BinaryDocValues. Since I need to get the values >>>> for multiple fields for each document, this bogs down pretty quickly. >>>> >>>> I'm wondering if there is a way to increase this throughput. Since >>>> filling a BytesRef is pretty fast, I was assuming it was the seek that was >>>> taking the time, but I didn't verify this. The first thing that came to >>>> mind is iterating the docValues in such a way that the next docValue could >>>> be loaded without a seek. But I haven't dug into how the BinaryDocValues >>>> are formatted so I'm not sure if this would help or not. Also there could >>>> be something else besides the seek that is limiting the throughput. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Joel Bernstein >>>> Search Engineer at Heliosearch >>>> >>>> >>>> On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[email protected]> wrote: >>>> >>>>> Yeah, i dont think its from newer docvalues-using code like yours shai. >>>>> >>>>> instead the problems i had doing this are historical, because e.g. >>>>> fieldcache pointed to large arrays and consumers were lazy about it, >>>>> knowing that there reference pointed to bytes that would remain valid >>>>> across invocations. >>>>> >>>>> we just have to remove these assumptions. I don't apologize for not >>>>> doing this, as you show, its some small % improvement (which we should go >>>>> and get back!), but i went with safety first initially rather than bugs. >>>>> >>>>> >>>>> >>>>> On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[email protected]> wrote: >>>>> >>>>>> I agree with Robert. We should leave cloning BytesRefs to whoever >>>>>> needs that, and not penalize everyone else who don't need it. I must say >>>>>> I >>>>>> didn't know I can "own" those BytesRefs and I clone them whenever I need >>>>>> to. I think I was bitten by one of the other APIs, so I assumed returned >>>>>> BytesRefs are not "mine" across all the APIs. >>>>>> >>>>>> Shai >>>>>> >>>>>> >>>>>> On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[email protected]>wrote: >>>>>> >>>>>>> the problem is really simpler to solve actually. >>>>>>> >>>>>>> Look at the comments in the code, it tells you why it is this way: >>>>>>> >>>>>>> // NOTE: we could have one buffer, but various consumers >>>>>>> (e.g. FieldComparatorSource) >>>>>>> // assume "they" own the bytes after calling this! >>>>>>> >>>>>>> That is what we should fix. There is no need to make bulk APIs or >>>>>>> even change the public api in any way (other than javadocs). >>>>>>> >>>>>>> We just move the clone'ing out of the codec, and require the >>>>>>> consumer to do it, same as termsenum or other apis. The codec part is >>>>>>> extremely simple here, its even the way i had it initially. >>>>>>> >>>>>>> But at the time (and even still now) this comes with some risk of >>>>>>> bugs. So initially I removed the reuse and went with a more conservative >>>>>>> approach to start with. >>>>>>> >>>>>>> >>>>>>> On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Adrian, >>>>>>>> >>>>>>>> Please find bulkGet() scratch. It's ugly copy-paste, just reuses >>>>>>>> ByteRef that provides 10% gain. >>>>>>>> ... >>>>>>>> bulkGet took:101630 ms >>>>>>>> ... >>>>>>>> get took:114422 ms >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[email protected]>wrote: >>>>>>>> >>>>>>>>> I don't think we should add such a method. Doc values are commonly >>>>>>>>> read from collectors, so why do we need a method that works on top >>>>>>>>> of >>>>>>>>> a DocIdSetIterator? I'm also curious how specialized >>>>>>>>> implementations >>>>>>>>> could make this method faster than the default implementation? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Adrien >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Sincerely yours >>>>>>>> Mikhail Khludnev >>>>>>>> Principal Engineer, >>>>>>>> Grid Dynamics >>>>>>>> >>>>>>>> <http://www.griddynamics.com> >>>>>>>> <[email protected]> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
