Sorry for the delay.  Clearing through my inbox backlog ...

We should double check the code, but one thing that has bitten me in the
past with variable-width data is the binary array builder ReserveData call
[1], does not act the same way Reserve works.  The former only grows the
buffer by the exact size requested.  The latter will grow buffers by a
power of two.

Also to answer one question.


> I'm not sure what the true
> contract of the memory pool and the associated allocator is. Does this
> interface guarantee that the memory allocated will not be zeroed out or
> touched in some similar fashion that would trigger physical memory mapping

I don't think we've defined anything specifically by way of this contract.
I don't know what the underlying allocators do, but I do not think the
MemoryPool code path will actually touch any bytes by itself.  So one
possibility at least as a prototype is you could potentially provide an
intercepting memory pool that handles larger

[1]
https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/arrow/array/builder_binary.h#L253

On Tue, Jun 16, 2020 at 8:07 AM Rémi Dettai <rdet...@gmail.com> wrote:

> Hi Antoine and all !
>
> Sorry for the delay, I wanted to understand things a bit better before
> getting back to you.  As discussed, I focussed on the Parquet case. I've
> looked into parquet/encoding.cc to see what could be done to have a better
> memory reservation with ByteArrays.
>
> On my journey, I noticed a few things:
> - The parquet dictionary is first deserialized into an array of views, then
> converted to an Arrow style variable/fixed size binary array (here
> <
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1628-L1652
> >/
> here
> <
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1657-L1670
> >).
> This later conversion is done even if it is not used afterwards, in fact it
> seems to be only used here
> <
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1840-L1845
> >(never
> used for fixed size binary and not used either for variable size binary if
> the target Arrow array is not dict encoded). So if you confirm my
> "discoveries", we could spare ourselves some dictionary conversions by
> doing them lazily when *InsertDictionary *is called.
> - It seems that DictByteArrayDecoderImpl::DecodeArrowNonNull
> <
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L2030
> >
> and DictByteArrayDecoderImpl::DecodeArrow
> <
> https://github.com/apache/arrow/blob/04a1867eeb58f0c515e7ee5a6300a8f61045a6cd/cpp/src/parquet/encoding.cc#L1974
> >
> have been short circuited inside
> ByteArrayDictionaryRecordReader::ReadValuesDense
> <
> https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1532
> >
> and ByteArrayDictionaryRecordReader::ReadValuesSpaced
> <
> https://github.com/apache/arrow/blob/6c721c579f7d279aa006bfff9b701f8a2a6fe50d/cpp/src/parquet/column_reader.cc#L1548
> >
> (dictionary encoded data pages are always RLE in parquet). This would mean
> that it is dead code now, isn't it ? Well, DecodeArrow methods are kind of
> public, but it is very confusing as they are not used by Arrow itself ;-)
>
> Finaly, regarding your idea (@antoine) of using a simple heuristic to
> pre-allocate the array, I'm not totally comfortable (maybe wrongly) with
> doing the over-allocation at that level because I'm not sure what the true
> contract of the memory pool and the associated allocator is. Does this
> interface guarantee that the memory allocated will not be zeroed out or
> touched in some similar fashion that would trigger physical memory mapping
> ? With my suggestion of using mmap to do huge allocations, I am positioning
> myself at a level where I know for sure that the underlying physical memory
> is not mapped because we don't touch the memory. But as you noticed, I have
> less knowledge about the context so it's very hard to decide when to
> trigger the "runway" pre-allocation or not.
>
> Hope this all makes sense. Took me a while to understand how the decoding
> works ;-)
>
> Remi
>
>
>
> Le ven. 5 juin 2020 à 17:20, Antoine Pitrou <anto...@python.org> a écrit :
>
> >
> > Le 05/06/2020 à 17:09, Rémi Dettai a écrit :
> > > I looked into the details of why the decoder could not estimate the
> > target
> > > Arrow array size for my Parquet column. It's because I am decoding from
> > > Parquet-Dictionary to Arrow-Plain (which is the default when loading
> > > Parquet). In this case the size prediction is impossible :-(
> >
> > But we can probably make up a heuristic.  For example
> >    avg(dictionary value size) * number of non-null values
> >
> > It would avoid a number of resizes, even though there may still be a
> > couple of them at the end.  It may oversize in some cases, but much less
> > than your proposed strategy of reserving a huge chunk of virtual memory
> :-)
> >
> > Regards
> >
> > Antoine.
> >
>

Reply via email to