[jira] [Commented] (NIFI-8154) AvroParquetHDFSRecordReader fails to read parquet file containing nested structs

Glenn Jones (Jira) Wed, 20 Jan 2021 13:00:05 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268869#comment-17268869
 ]


Glenn Jones commented on NIFI-8154:
-----------------------------------

The test fails because it expects a field in the Record produced by 
ConvertAvroToParquet to be named "map", but it is actually named "key_value".

In parquet-avro 1.10.0, AvroParquetWriter produces parquet with a schema that 
includes the following definition for the mymap field from the test avro:

required group mymap (MAP) {
 repeated group map (MAP_KEY_VALUE) {
 required binary key (UTF8);
 required int32 value;
 }
 }

This doesn't conform to the Map logical type, but it is within the [backward 
compatibility 
rules|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1]

In parquet-avro 1.11.1, AvroParquetWriter produces the following which I think 
is more correct (the middle level is named "key_value" instead of "map")
 
 required group mymap (MAP) {
 repeated group key_value (MAP_KEY_VALUE) {
 required binary key (STRING);
 required int32 value;
 }
 }

The test uses GroupReadSupport to read the parquet into something it can 
examine and as a result the middle level group name has changed from "map" to 
"key_value".  I doubt that other ReadSupport implementations would expose the 
name of the middle level group in this way, so perhaps this wouldn't have been 
an issue if the tests had used AvroReadSupport.  In any case, I think it's fine 
to simply update the tests to expect the field names from the 1.11.1 
AvroParquetWriter.

> AvroParquetHDFSRecordReader fails to read parquet file containing nested 
> structs
> --------------------------------------------------------------------------------
>
>                 Key: NIFI-8154
>                 URL: https://issues.apache.org/jira/browse/NIFI-8154
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.11.3, 1.12.1
>            Reporter: Glenn Jones
>            Priority: Minor
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> FetchParquet can't be used to process files containing nested structs.  When 
> trying to create a RecordSchema it runs into 
> https://issues.apache.org/jira/browse/PARQUET-1441, which causes it to fail.  
> We've patched this locally by building the nifi-parquet-processors with 
> parquet-avro 1.11.0, but it would be great if this made it into the next 
> release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NIFI-8154) AvroParquetHDFSRecordReader fails to read parquet file containing nested structs

Reply via email to