[
https://issues.apache.org/jira/browse/ARROW-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rok Mihevc updated ARROW-4688:
------------------------------
External issue URL: https://github.com/apache/arrow/issues/21216
> [C++][Parquet] 16MB limit on (nested) column chunk prevents tuning
> row_group_size
> ---------------------------------------------------------------------------------
>
> Key: ARROW-4688
> URL: https://issues.apache.org/jira/browse/ARROW-4688
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Remek Zajac
> Assignee: Wes McKinney
> Priority: Blocker
> Labels: parquet, pull-request-available
> Fix For: 0.13.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> We working on parquet files that involve nested lists. At most they are
> multi-dimensional lists of simple types (never structs), but i understand,
> for Parquet, they're still nested columns and involve repetition levels.
> Some of these columns hold lists of rather large byte arrays (that dominate
> the overall size of the row). When we bump the `row_group_size` to above 16MB
> we see:
>
> {code:java}
> File "pyarrow/_parquet.pyx", line 700, in
> pyarrow._parquet.ParquetReader.read_row_group
> File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented
> for chunked array outputs{code}
>
> I conclude it's
> [this|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L848]
> bit complaining:
>
> {code:java}
> template <typename ParquetType>
> Status PrimitiveImpl::WrapIntoListArray(Datum* inout_array) {
> if (descr_->max_repetition_level() == 0) {
> // Flat, no action
> return Status::OK();
> }
>
> std::shared_ptr<Array> flat_array;
>
> // ARROW-3762(wesm): If inout_array is a chunked array, we reject as
> this is
> // not yet implemented
> if (inout_array->kind() == Datum::CHUNKED_ARRAY) {
> if (inout_array->chunked_array()->num_chunks() > 1) {
> return Status::NotImplemented(
> "Nested data conversions not implemented for "
> "chunked array outputs");{code}
>
> This appears to happen in the callstack of
> ColumnReader::ColumnReaderImpl::NextBatch
> and it appears to be provoked by
> [this|https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/arrow/record_reader.cc#L604]
> constant:
> {code:java}
> template <>
> void TypedRecordReader<ByteArrayType>::InitializeBuilder() {
> // Maximum of 16MB chunks
> constexpr int32_t kBinaryChunksize = 1 << 24;
> DCHECK_EQ(descr_->physical_type(), Type::BYTE_ARRAY);
> builder_.reset(
> new::arrow::internal::ChunkedBinaryBuilder(kBinaryChunksize, pool_)); }
> {code}
> Which appears to imply that the column chunk data, if larger than
> kBinaryChunksize (hardcoded to 16MB), is returned as a Datum::CHUNKED_ARRAY
> of more than one (16MB) chunks. Which ultimatelly leads to the
> Status::NotImplemented error.
> We have no influence over what data we ingest, we have some influence in how
> we flatten it and we need to tune our row_group_size to something sensibly
> larger than 16MB.
> We have see no obvious workaround for this and so we need to ask (1) if the
> above diagnosis appears to correct (2) do people see any sensible workarounds
> (3) is there an imminent intention to fix this in the Arrow community and if
> not, how difficult would it be to fix this (in case we can afford helping)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)