[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field

ASF GitHub Bot (Jira) Thu, 25 Jun 2020 04:29:11 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144855#comment-17144855
 ]


ASF GitHub Bot commented on PARQUET-1879:
-----------------------------------------

maccamlc commented on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-649482253


   > Thank you for creating the backward compatibility test for Map. It should 
have been existed already.
   > Unfortunately, this way you do not properly test backward compatibility. 
The problem is you cannot generate an "old" file with the "new" library. To be 
more precise the message parser is more for convenience and not used while 
reading/writing a parquet file. When you say you are testing converted type it 
is not really true because the parser tries to read logical types at the first 
place. Also the parquet writer writes both logical types and converted types so 
you cannot validate old files that have only converted types.
   > I would suggest adding tests that covers the examples in the 
[spec](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1)
 by creating the thrift generated format objects and convert them by 
`ParquetMetadataConverter` just like you did it in 
`TestParquetMetadataConverter.testMapLogicalType`. Maybe, these tests would fit 
better in that class as well.
   > I should have been described this before you've implemented this test. I 
am sorry about that.
   > 
   > Please don't force push your changes because it makes harder to track the 
review. The committer will squash the PR before merging it anyway.
   
   
   
   > Thank you for creating the backward compatibility test for Map. It should 
have been existed already.
   > Unfortunately, this way you do not properly test backward compatibility. 
The problem is you cannot generate an "old" file with the "new" library. To be 
more precise the message parser is more for convenience and not used while 
reading/writing a parquet file. When you say you are testing converted type it 
is not really true because the parser tries to read logical types at the first 
place. Also the parquet writer writes both logical types and converted types so 
you cannot validate old files that have only converted types.
   > I would suggest adding tests that covers the examples in the 
[spec](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1)
 by creating the thrift generated format objects and convert them by 
`ParquetMetadataConverter` just like you did it in 
`TestParquetMetadataConverter.testMapLogicalType`. Maybe, these tests would fit 
better in that class as well.
   > I should have been described this before you've implemented this test. I 
am sorry about that.
   > 
   > Please don't force push your changes because it makes harder to track the 
review. The committer will squash the PR before merging it anyway.
   
   Apologies for the force push. Good to know that squashed on commit.
   
   And thanks for the detailed reply. 
   
   I think I got it this time :) 
   
   Tests were moved into TestParquetMetadataConverter and for the old format 
test, building the metadata through Thrift SchemaElements.
   
   Regards,
   Matt


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with 
> a Map field
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-1879
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1879
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro, parquet-format
>    Affects Versions: 1.11.0
>            Reporter: Matthew McMahon
>            Priority: Critical
>
> From my 
> [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with]
>  in relation to an issue I'm having with getting Snowflake (Cloud DB) to load 
> Parquet files written with version 1.11.0
> ----
> The problem only appears when using a map schema field in the Avro schema. 
> For example:
> {code:java}
>     {
>       "name": "FeatureAmounts",
>       "type": {
>         "type": "map",
>         "values": "records.MoneyDecimal"
>       }
>     }
> {code}
> When using Parquet-Avro to write the file, a bad Parquet schema ends up with, 
> for example
> {code:java}
> message record.ResponseRecord {
>   required binary GroupId (STRING);
>   required int64 EntryTime (TIMESTAMP(MILLIS,true));
>   required int64 HandlingDuration;
>   required binary Id (STRING);
>   optional binary ResponseId (STRING);
>   required binary RequestId (STRING);
>   optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15));
>   required group FeatureAmounts (MAP) {
>     repeated group map (MAP_KEY_VALUE) {
>       required binary key (STRING);
>       required fixed_len_byte_array(12) value (DECIMAL(28,15));
>     }
>   }
> }
> {code}
> From the great answer to my StackOverflow, it seems the issue is that the 
> 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, 
> that has no logical type equivalent. From the comment on 
> [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904]
> {code:java}
> // This logical type annotation is implemented to support backward 
> compatibility with ConvertedType.
>   // The new logical type representation in parquet-format doesn't have any 
> key-value type,
>   // thus this annotation is mapped to UNKNOWN. This type shouldn't be used.
> {code}
> However, it seems this is being written with the latest 1.11.0, which then 
> causes Apache Arrow to fail with
> {code:java}
> Logical type Null can not be applied to group node
> {code}
> As it appears that 
> [Arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632]
>  only looks for the new logical type of Map or List, therefore this causes an 
> error.
> I have seen in Parquet Formats that 
> [LogicalTypes|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]
>  should be something like
> {code:java}
> // Map<String, Integer>
> required group my_map (MAP) {
>   repeated group key_value {
>     required binary key (UTF8);
>     optional int32 value;
>   }
> }
> {code}
> Is this on the correct path?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field

Reply via email to