[ https://issues.apache.org/jira/browse/ARROW-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-4688: -------------------------------- Priority: Blocker (was: Major) > [C++][Parquet] 16MB limit on (nested) column chunk prevents tuning > row_group_size > --------------------------------------------------------------------------------- > > Key: ARROW-4688 > URL: https://issues.apache.org/jira/browse/ARROW-4688 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Remek Zajac > Priority: Blocker > Labels: parquet > Fix For: 0.13.0 > > > We working on parquet files that involve nested lists. At most they are > multi-dimensional lists of simple types (never structs), but i understand, > for Parquet, they're still nested columns and involve repetition levels. > Some of these columns hold lists of rather large byte arrays (that dominate > the overall size of the row). When we bump the `row_group_size` to above 16MB > we see: > > {code:java} > File "pyarrow/_parquet.pyx", line 700, in > pyarrow._parquet.ParquetReader.read_row_group > File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented > for chunked array outputs{code} > > I conclude it's > [this|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L848] > bit complaining: > > {code:java} > template <typename ParquetType> > Status PrimitiveImpl::WrapIntoListArray(Datum* inout_array) { > if (descr_->max_repetition_level() == 0) { > // Flat, no action > return Status::OK(); > } > > std::shared_ptr<Array> flat_array; > > // ARROW-3762(wesm): If inout_array is a chunked array, we reject as > this is > // not yet implemented > if (inout_array->kind() == Datum::CHUNKED_ARRAY) { > if (inout_array->chunked_array()->num_chunks() > 1) { > return Status::NotImplemented( > "Nested data conversions not implemented for " > "chunked array outputs");{code} > > This appears to happen in the callstack of > ColumnReader::ColumnReaderImpl::NextBatch > and it appears to be provoked by > [this|https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/arrow/record_reader.cc#L604] > constant: > {code:java} > template <> > void TypedRecordReader<ByteArrayType>::InitializeBuilder() { > // Maximum of 16MB chunks > constexpr int32_t kBinaryChunksize = 1 << 24; > DCHECK_EQ(descr_->physical_type(), Type::BYTE_ARRAY); > builder_.reset( > new::arrow::internal::ChunkedBinaryBuilder(kBinaryChunksize, pool_)); } > {code} > Which appears to imply that the column chunk data, if larger than > kBinaryChunksize (hardcoded to 16MB), is returned as a Datum::CHUNKED_ARRAY > of more than one (16MB) chunks. Which ultimatelly leads to the > Status::NotImplemented error. > We have no influence over what data we ingest, we have some influence in how > we flatten it and we need to tune our row_group_size to something sensibly > larger than 16MB. > We have see no obvious workaround for this and so we need to ask (1) if the > above diagnosis appears to correct (2) do people see any sensible workarounds > (3) is there an imminent intention to fix this in the Arrow community and if > not, how difficult would it be to fix this (in case we can afford helping) -- This message was sent by Atlassian JIRA (v7.6.3#76005)