[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

Micah Kornfield (Jira) Tue, 30 Aug 2022 12:53:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598038#comment-17598038
 ]


Micah Kornfield commented on ARROW-17459:
-----------------------------------------

1.  ChunkedArrays have a Flatten method that will do this but I don't think it 
will help in this case.  IIRC the challenge in this case is that parquet only 
yields chunked arrays if the underlying column data cannot fit into the right 
arrow structure.  In this case for Utf8 arrays it means the sum of bytes across 
all strings has to be less then INT_MAX length.  Otherwise it would need to 
flatten to LargeUtf8 which has implications for schema conversion.  Structs and 
lists always expected Arrays as their inner element types and not chunked 
arrays.
2.  doesn't necessarily seem like the right approach.  
3.  Per 1, this isn't really the issue I think.  The approach here that could 
work (I don't remember all the code paths) is to vary the number of rows read 
back if not all rows are huge).

One way forward here could be to add an option for reading back arrays to 
always use the Large* variant (or maybe on a per column basis) to avoid 
chunking.

> [C++] Support nested data conversions for chunked array
> -------------------------------------------------------
>
>                 Key: ARROW-17459
>                 URL: https://issues.apache.org/jira/browse/ARROW-17459
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Arthur Passos
>            Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

Reply via email to