Matthew McMahon created PARQUET-1879:
----------------------------------------

             Summary: Apache Arrow can not read a Parquet File written with 
Parqet-Avro 1.11.0 with a Map field
                 Key: PARQUET-1879
                 URL: https://issues.apache.org/jira/browse/PARQUET-1879
             Project: Parquet
          Issue Type: Bug
          Components: parquet-avro
    Affects Versions: 1.11.0
            Reporter: Matthew McMahon


>From my 
>[StackOverflow|[https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with]]
> in relation to an issue I'm having with getting Snowflake (Cloud DB) to load 
>Parquet files written with version 1.11.0

----

The problem only appears when using a map schema field in the Avro schema. For 
example:

{code}
    {
      "name": "FeatureAmounts",
      "type": {
        "type": "map",
        "values": "records.MoneyDecimal"
      }
    }
{code}

When using Parquet-Avro to write the file, a bad Parquet schema ends up with, 
for example

{code}
message record.ResponseRecord {
  required binary GroupId (STRING);
  required int64 EntryTime (TIMESTAMP(MILLIS,true));
  required int64 HandlingDuration;
  required binary Id (STRING);
  optional binary ResponseId (STRING);
  required binary RequestId (STRING);
  optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15));
  required group FeatureAmounts (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required binary key (STRING);
      required fixed_len_byte_array(12) value (DECIMAL(28,15));
    }
  }
}
{code}

>From the great answer to my StackOverflow, it seems the issue is that the 
>1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, 
>that has no logical type equivalent. From the comment on 
>[LogicalTypeAnnotation](https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904)

{code}
// This logical type annotation is implemented to support backward 
compatibility with ConvertedType.
  // The new logical type representation in parquet-format doesn't have any 
key-value type,
  // thus this annotation is mapped to UNKNOWN. This type shouldn't be used.
{code}

However, it seems this is being written with the latest 1.11.0, which then 
causes Apache Arrow to fail with 

{code}
Logical type Null can not be applied to group node
{code}

As it appears that 
[Arrow](https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632)
 only looks for the new logical type of Map or List, therefore this causes an 
error.

I have seen in Parquet Formats that 
[LogicalTypes]{https://github.com/apache/parquet-format/blob/master/LogicalTypes.md}
 should be something like

{code}
// Map<String, Integer>
required group my_map (MAP) {
  repeated group key_value {
    required binary key (UTF8);
    optional int32 value;
  }
}
{code}

Is this on the correct path?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to