Bulk extracting full unsorted result sets from Solr. You give Solr a query and it dumps the full result in a single call. The result set streaming is in place, but throughput is not as good as I would like it.
Joel Bernstein Search Engineer at Heliosearch On Fri, Jan 10, 2014 at 3:24 PM, Robert Muir <[email protected]> wrote: > what are you doing with the data? > > > On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[email protected]>wrote: > >> I'll provide a little more context. I'm working on bulk extracting >> BinaryDocValues. My initial performance test was with in-memory >> binaryDocValues, but I think the end game is actually disk-based >> binaryDocValues. >> >> I was able to perform around 1 million docId->BytesRef lookups per-second >> with in-memory BinaryDocValues. Since I need to get the values for multiple >> fields for each document, this bogs down pretty quickly. >> >> I'm wondering if there is a way to increase this throughput. Since >> filling a BytesRef is pretty fast, I was assuming it was the seek that was >> taking the time, but I didn't verify this. The first thing that came to >> mind is iterating the docValues in such a way that the next docValue could >> be loaded without a seek. But I haven't dug into how the BinaryDocValues >> are formatted so I'm not sure if this would help or not. Also there could >> be something else besides the seek that is limiting the throughput. >> >> >> >> >> >> >> >> >> Joel Bernstein >> Search Engineer at Heliosearch >> >> >> On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[email protected]> wrote: >> >>> Yeah, i dont think its from newer docvalues-using code like yours shai. >>> >>> instead the problems i had doing this are historical, because e.g. >>> fieldcache pointed to large arrays and consumers were lazy about it, >>> knowing that there reference pointed to bytes that would remain valid >>> across invocations. >>> >>> we just have to remove these assumptions. I don't apologize for not >>> doing this, as you show, its some small % improvement (which we should go >>> and get back!), but i went with safety first initially rather than bugs. >>> >>> >>> >>> On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[email protected]> wrote: >>> >>>> I agree with Robert. We should leave cloning BytesRefs to whoever needs >>>> that, and not penalize everyone else who don't need it. I must say I didn't >>>> know I can "own" those BytesRefs and I clone them whenever I need to. I >>>> think I was bitten by one of the other APIs, so I assumed returned >>>> BytesRefs are not "mine" across all the APIs. >>>> >>>> Shai >>>> >>>> >>>> On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[email protected]> wrote: >>>> >>>>> the problem is really simpler to solve actually. >>>>> >>>>> Look at the comments in the code, it tells you why it is this way: >>>>> >>>>> // NOTE: we could have one buffer, but various consumers >>>>> (e.g. FieldComparatorSource) >>>>> // assume "they" own the bytes after calling this! >>>>> >>>>> That is what we should fix. There is no need to make bulk APIs or even >>>>> change the public api in any way (other than javadocs). >>>>> >>>>> We just move the clone'ing out of the codec, and require the consumer >>>>> to do it, same as termsenum or other apis. The codec part is extremely >>>>> simple here, its even the way i had it initially. >>>>> >>>>> But at the time (and even still now) this comes with some risk of >>>>> bugs. So initially I removed the reuse and went with a more conservative >>>>> approach to start with. >>>>> >>>>> >>>>> On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev < >>>>> [email protected]> wrote: >>>>> >>>>>> Adrian, >>>>>> >>>>>> Please find bulkGet() scratch. It's ugly copy-paste, just reuses >>>>>> ByteRef that provides 10% gain. >>>>>> ... >>>>>> bulkGet took:101630 ms >>>>>> ... >>>>>> get took:114422 ms >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[email protected]>wrote: >>>>>> >>>>>>> I don't think we should add such a method. Doc values are commonly >>>>>>> read from collectors, so why do we need a method that works on top of >>>>>>> a DocIdSetIterator? I'm also curious how specialized implementations >>>>>>> could make this method faster than the default implementation? >>>>>>> >>>>>>> -- >>>>>>> Adrien >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>> For additional commands, e-mail: [email protected] >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Sincerely yours >>>>>> Mikhail Khludnev >>>>>> Principal Engineer, >>>>>> Grid Dynamics >>>>>> >>>>>> <http://www.griddynamics.com> >>>>>> <[email protected]> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> For additional commands, e-mail: [email protected] >>>>>> >>>>> >>>>> >>>> >>> >> >
