The text value for the field. It's a generic bulk extract tool so whatever the user loads.
Joel Bernstein Search Engineer at Heliosearch On Fri, Jan 10, 2014 at 4:08 PM, Robert Muir <[email protected]> wrote: > what are you storing in the binaryDV? > > > On Fri, Jan 10, 2014 at 3:44 PM, Joel Bernstein <[email protected]>wrote: > >> For the test I ran, I just timed the number of docId->bytesRef lookups I >> could do in a second. >> >> Joel Bernstein >> Search Engineer at Heliosearch >> >> >> On Fri, Jan 10, 2014 at 3:41 PM, Robert Muir <[email protected]> wrote: >> >>> Are you sure its not the wire serialization etc causing the bottleneck >>> (e.g. converting to utf-8 string and back, network traffic, json encoding, >>> etc etc?) >>> >>> >>> On Fri, Jan 10, 2014 at 3:39 PM, Joel Bernstein <[email protected]>wrote: >>> >>>> Bulk extracting full unsorted result sets from Solr. You give Solr a >>>> query and it dumps the full result in a single call. The result set >>>> streaming is in place, but throughput is not as good as I would like it. >>>> >>>> Joel Bernstein >>>> Search Engineer at Heliosearch >>>> >>>> >>>> On Fri, Jan 10, 2014 at 3:24 PM, Robert Muir <[email protected]> wrote: >>>> >>>>> what are you doing with the data? >>>>> >>>>> >>>>> On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[email protected]>wrote: >>>>> >>>>>> I'll provide a little more context. I'm working on bulk extracting >>>>>> BinaryDocValues. My initial performance test was with in-memory >>>>>> binaryDocValues, but I think the end game is actually disk-based >>>>>> binaryDocValues. >>>>>> >>>>>> I was able to perform around 1 million docId->BytesRef lookups >>>>>> per-second with in-memory BinaryDocValues. Since I need to get the values >>>>>> for multiple fields for each document, this bogs down pretty quickly. >>>>>> >>>>>> I'm wondering if there is a way to increase this throughput. Since >>>>>> filling a BytesRef is pretty fast, I was assuming it was the seek that >>>>>> was >>>>>> taking the time, but I didn't verify this. The first thing that came to >>>>>> mind is iterating the docValues in such a way that the next docValue >>>>>> could >>>>>> be loaded without a seek. But I haven't dug into how the BinaryDocValues >>>>>> are formatted so I'm not sure if this would help or not. Also there could >>>>>> be something else besides the seek that is limiting the throughput. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Joel Bernstein >>>>>> Search Engineer at Heliosearch >>>>>> >>>>>> >>>>>> On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[email protected]>wrote: >>>>>> >>>>>>> Yeah, i dont think its from newer docvalues-using code like yours >>>>>>> shai. >>>>>>> >>>>>>> instead the problems i had doing this are historical, because e.g. >>>>>>> fieldcache pointed to large arrays and consumers were lazy about it, >>>>>>> knowing that there reference pointed to bytes that would remain valid >>>>>>> across invocations. >>>>>>> >>>>>>> we just have to remove these assumptions. I don't apologize for not >>>>>>> doing this, as you show, its some small % improvement (which we should >>>>>>> go >>>>>>> and get back!), but i went with safety first initially rather than bugs. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[email protected]>wrote: >>>>>>> >>>>>>>> I agree with Robert. We should leave cloning BytesRefs to whoever >>>>>>>> needs that, and not penalize everyone else who don't need it. I must >>>>>>>> say I >>>>>>>> didn't know I can "own" those BytesRefs and I clone them whenever I >>>>>>>> need >>>>>>>> to. I think I was bitten by one of the other APIs, so I assumed >>>>>>>> returned >>>>>>>> BytesRefs are not "mine" across all the APIs. >>>>>>>> >>>>>>>> Shai >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[email protected]>wrote: >>>>>>>> >>>>>>>>> the problem is really simpler to solve actually. >>>>>>>>> >>>>>>>>> Look at the comments in the code, it tells you why it is this way: >>>>>>>>> >>>>>>>>> // NOTE: we could have one buffer, but various consumers >>>>>>>>> (e.g. FieldComparatorSource) >>>>>>>>> // assume "they" own the bytes after calling this! >>>>>>>>> >>>>>>>>> That is what we should fix. There is no need to make bulk APIs or >>>>>>>>> even change the public api in any way (other than javadocs). >>>>>>>>> >>>>>>>>> We just move the clone'ing out of the codec, and require the >>>>>>>>> consumer to do it, same as termsenum or other apis. The codec part is >>>>>>>>> extremely simple here, its even the way i had it initially. >>>>>>>>> >>>>>>>>> But at the time (and even still now) this comes with some risk of >>>>>>>>> bugs. So initially I removed the reuse and went with a more >>>>>>>>> conservative >>>>>>>>> approach to start with. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Adrian, >>>>>>>>>> >>>>>>>>>> Please find bulkGet() scratch. It's ugly copy-paste, just reuses >>>>>>>>>> ByteRef that provides 10% gain. >>>>>>>>>> ... >>>>>>>>>> bulkGet took:101630 ms >>>>>>>>>> ... >>>>>>>>>> get took:114422 ms >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand >>>>>>>>>> <[email protected]>wrote: >>>>>>>>>> >>>>>>>>>>> I don't think we should add such a method. Doc values are >>>>>>>>>>> commonly >>>>>>>>>>> read from collectors, so why do we need a method that works on >>>>>>>>>>> top of >>>>>>>>>>> a DocIdSetIterator? I'm also curious how specialized >>>>>>>>>>> implementations >>>>>>>>>>> could make this method faster than the default implementation? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Adrien >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Sincerely yours >>>>>>>>>> Mikhail Khludnev >>>>>>>>>> Principal Engineer, >>>>>>>>>> Grid Dynamics >>>>>>>>>> >>>>>>>>>> <http://www.griddynamics.com> >>>>>>>>>> <[email protected]> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
