[ 
https://issues.apache.org/jira/browse/ARROW-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4688:
--------------------------------
    Priority: Blocker  (was: Major)

> [C++][Parquet] 16MB limit on (nested) column chunk prevents tuning 
> row_group_size
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-4688
>                 URL: https://issues.apache.org/jira/browse/ARROW-4688
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Remek Zajac
>            Priority: Blocker
>              Labels: parquet
>             Fix For: 0.13.0
>
>
> We working on parquet files that involve nested lists. At most they are 
> multi-dimensional lists of simple types (never structs), but i understand, 
> for Parquet, they're still nested columns and involve repetition levels. 
> Some of these columns hold lists of rather large byte arrays (that dominate 
> the overall size of the row). When we bump the `row_group_size` to above 16MB 
> we see: 
>  
> {code:java}
> File "pyarrow/_parquet.pyx", line 700, in 
> pyarrow._parquet.ParquetReader.read_row_group
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented 
> for chunked array outputs{code}
>  
> I conclude it's 
> [this|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L848]
>  bit complaining:
>  
> {code:java}
> template <typename ParquetType>
>       Status PrimitiveImpl::WrapIntoListArray(Datum* inout_array) {
>       if (descr_->max_repetition_level() == 0) {
>         // Flat, no action
>         return Status::OK();
>       }
>       
>       std::shared_ptr<Array> flat_array;
>       
>       // ARROW-3762(wesm): If inout_array is a chunked array, we reject as 
> this is
>       // not yet implemented
>       if (inout_array->kind() == Datum::CHUNKED_ARRAY) {
>         if (inout_array->chunked_array()->num_chunks() > 1) {
>           return Status::NotImplemented(
>             "Nested data conversions not implemented for "
>             "chunked array outputs");{code}
>  
> This appears to happen in the callstack of 
> ColumnReader::ColumnReaderImpl::NextBatch 
> and it appears to be provoked by 
> [this|https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/arrow/record_reader.cc#L604]
>  constant:
> {code:java}
> template <>     
> void TypedRecordReader<ByteArrayType>::InitializeBuilder() {     
>   // Maximum of 16MB chunks     
>   constexpr int32_t kBinaryChunksize = 1 << 24;     
>   DCHECK_EQ(descr_->physical_type(), Type::BYTE_ARRAY);       
>   builder_.reset(
>     new::arrow::internal::ChunkedBinaryBuilder(kBinaryChunksize, pool_));  }  
>  {code}
> Which appears to imply that the column chunk data, if larger than 
> kBinaryChunksize (hardcoded to 16MB), is returned as a Datum::CHUNKED_ARRAY 
> of more than one (16MB) chunks. Which ultimatelly leads to the 
> Status::NotImplemented error.
> We have no influence over what data we ingest, we have some influence in how 
> we flatten it and we need to tune our row_group_size to something sensibly 
> larger than 16MB. 
> We have see no obvious workaround for this and so we need to ask (1) if the 
> above diagnosis appears to correct (2) do people see any sensible workarounds 
> (3) is there an imminent intention to fix this in the Arrow community and if 
> not, how difficult would it be to fix this (in case we can afford helping)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to