Sorry for the delay. Clearing through my inbox backlog ... We should double check the code, but one thing that has bitten me in the past with variable-width data is the binary array builder ReserveData call [1], does not act the same way Reserve works. The former only grows the buffer by the exact size requested. The latter will grow buffers by a power of two.
Also to answer one question. > I'm not sure what the true > contract of the memory pool and the associated allocator is. Does this > interface guarantee that the memory allocated will not be zeroed out or > touched in some similar fashion that would trigger physical memory mapping I don't think we've defined anything specifically by way of this contract. I don't know what the underlying allocators do, but I do not think the MemoryPool code path will actually touch any bytes by itself. So one possibility at least as a prototype is you could potentially provide an intercepting memory pool that handles larger [1] https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/arrow/array/builder_binary.h#L253 On Tue, Jun 16, 2020 at 8:07 AM Rémi Dettai <rdet...@gmail.com> wrote: > Hi Antoine and all ! > > Sorry for the delay, I wanted to understand things a bit better before > getting back to you. As discussed, I focussed on the Parquet case. I've > looked into parquet/encoding.cc to see what could be done to have a better > memory reservation with ByteArrays. > > On my journey, I noticed a few things: > - The parquet dictionary is first deserialized into an array of views, then > converted to an Arrow style variable/fixed size binary array (here > < > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1628-L1652 > >/ > here > < > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1657-L1670 > >). > This later conversion is done even if it is not used afterwards, in fact it > seems to be only used here > < > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1840-L1845 > >(never > used for fixed size binary and not used either for variable size binary if > the target Arrow array is not dict encoded). So if you confirm my > "discoveries", we could spare ourselves some dictionary conversions by > doing them lazily when *InsertDictionary *is called. > - It seems that DictByteArrayDecoderImpl::DecodeArrowNonNull > < > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L2030 > > > and DictByteArrayDecoderImpl::DecodeArrow > < > https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1974 > > > have been short circuited inside > ByteArrayDictionaryRecordReader::ReadValuesDense > < > https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1532 > > > and ByteArrayDictionaryRecordReader::ReadValuesSpaced > < > https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1548 > > > (dictionary encoded data pages are always RLE in parquet). This would mean > that it is dead code now, isn't it ? Well, DecodeArrow methods are kind of > public, but it is very confusing as they are not used by Arrow itself ;-) > > Finaly, regarding your idea (@antoine) of using a simple heuristic to > pre-allocate the array, I'm not totally comfortable (maybe wrongly) with > doing the over-allocation at that level because I'm not sure what the true > contract of the memory pool and the associated allocator is. Does this > interface guarantee that the memory allocated will not be zeroed out or > touched in some similar fashion that would trigger physical memory mapping > ? With my suggestion of using mmap to do huge allocations, I am positioning > myself at a level where I know for sure that the underlying physical memory > is not mapped because we don't touch the memory. But as you noticed, I have > less knowledge about the context so it's very hard to decide when to > trigger the "runway" pre-allocation or not. > > Hope this all makes sense. Took me a while to understand how the decoding > works ;-) > > Remi > > > > Le ven. 5 juin 2020 à 17:20, Antoine Pitrou <anto...@python.org> a écrit : > > > > > Le 05/06/2020 à 17:09, Rémi Dettai a écrit : > > > I looked into the details of why the decoder could not estimate the > > target > > > Arrow array size for my Parquet column. It's because I am decoding from > > > Parquet-Dictionary to Arrow-Plain (which is the default when loading > > > Parquet). In this case the size prediction is impossible :-( > > > > But we can probably make up a heuristic. For example > > avg(dictionary value size) * number of non-null values > > > > It would avoid a number of resizes, even though there may still be a > > couple of them at the end. It may oversize in some cases, but much less > > than your proposed strategy of reserving a huge chunk of virtual memory > :-) > > > > Regards > > > > Antoine. > > >