[ 
https://issues.apache.org/jira/browse/PARQUET-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Lacelle updated PARQUET-1202:
------------------------------------
    Description: 
Hello,

While reading back a Parquet file produced with Spark, it appears the schema 
produced by Parquet-Avro is not valid.

I consider the simple following piece of code:

{code}

ParquetReader<GenericRecord> reader =

             AvroParquetReader.<GenericRecord>builder(new 
org.apache.hadoop.fs.Path(path.toUri())).build();

             System.out.println(reader.read().getSchema());

{code}

I get a stack lile:

{code}

Exception in thread "main" +org.apache.avro.SchemaParseException+: Can't 
redefine: value

       at org.apache.avro.Schema$Names.put(+Schema.java:1128+)

       at org.apache.avro.Schema$NamedSchema.writeNameRef(+Schema.java:562+)

       at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:690+)

       at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)

       at org.apache.avro.Schema$MapSchema.toJson(+Schema.java:833+)

       at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)

       at org.apache.avro.Schema$RecordSchema.fieldsToJson(+Schema.java:716+)

       at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:701+)

       at org.apache.avro.Schema.toString(+Schema.java:324+)

       at org.apache.avro.Schema.toString(+Schema.java:314+)

{code}

 

The issue seems the same as the one reported in:

[https://www.bountysource.com/issues/22823013-spark-avro-fails-to-save-df-with-nested-records-having-the-same-name]

 

It have been fixed in Spark-avro within:

[https://github.com/databricks/spark-avro/pull/73]

In our case, the parquet schema looks like:

{code}

message spark_schema {
         optional group calculatedobjectinfomap (MAP) {
                 repeated group key_value {
                         required binary key (UTF8);
                         optional group value {
                                 optional int64 calcobjid;
                                 optional int64 calcobjparentid;
                                 optional binary portfolioname (UTF8);
                                 optional binary portfolioscheme (UTF8);
                                 optional binary calcobjtype (UTF8);
                                 optional binary calcobjmnemonic (UTF8);
                                 optional binary calcobinstrumentype (UTF8);
                                 optional int64 calcobjectqty;
                                 optional binary calcobjboid (UTF8);
                                 optional binary analyticalfoldermnemonic 
(UTF8);
                                 optional binary calculatedidentifier (UTF8);
                                 optional binary calcobjlevel (UTF8);
                                 optional binary calcobjboidscheme (UTF8);
                         }
                }
        }
        optional group riskfactorinfomap (MAP) {
                 repeated group key_value {
                         required binary key (UTF8);
                         optional group value {
                         optional binary riskfactorname (UTF8);
                         optional binary riskfactortype (UTF8);
                         optional binary riskfactorrole (UTF8);
                         }
                 }
         }
}

{code}
We indeed have 2 Map field with a value fields named 'value'. The name 'value' 
is defaulted in org.apache.spark.sql.types.MapType. 

The fix seems not trivial given current parquet-avro code then I doubt I will 
be able to craft a valid PR without directions.


Thanks,

  was:
Hello,

While reading back a Parquet file produced with Spark, it appears the schema 
produced by Parquet-Avro is not valid.

I consider the simple following piece of code:

{code}

ParquetReader<GenericRecord> reader =

             AvroParquetReader.<GenericRecord>builder(new 
org.apache.hadoop.fs.Path(path.toUri())).build();

             System.out.println(reader.read().getSchema());

{code}

I get a stack lile:

{code}

Exception in thread "main" +org.apache.avro.SchemaParseException+: Can't 
redefine: value

       at org.apache.avro.Schema$Names.put(+Schema.java:1128+)

       at org.apache.avro.Schema$NamedSchema.writeNameRef(+Schema.java:562+)

       at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:690+)

       at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)

       at org.apache.avro.Schema$MapSchema.toJson(+Schema.java:833+)

       at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)

       at org.apache.avro.Schema$RecordSchema.fieldsToJson(+Schema.java:716+)

       at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:701+)

       at org.apache.avro.Schema.toString(+Schema.java:324+)

       at org.apache.avro.Schema.toString(+Schema.java:314+)

{code}

 

The issue seems the same as the one reported in:

[https://www.bountysource.com/issues/22823013-spark-avro-fails-to-save-df-with-nested-records-having-the-same-name]

 

It have been fixed in Spark-avro within:

[https://github.com/databricks/spark-avro/pull/73]

In our case, the parquet schema looks like:

{code}

message spark_schema {
 optional group calculatedobjectinfomap (MAP) {
 repeated group key_value {
 required binary key (UTF8);
 optional group value {
 optional int64 calcobjid;
 optional int64 calcobjparentid;
 optional binary portfolioname (UTF8);
 optional binary portfolioscheme (UTF8);
 optional binary calcobjtype (UTF8);
 optional binary calcobjmnemonic (UTF8);
 optional binary calcobinstrumentype (UTF8);
 optional int64 calcobjectqty;
 optional binary calcobjboid (UTF8);
 optional binary analyticalfoldermnemonic (UTF8);
 optional binary calculatedidentifier (UTF8);
 optional binary calcobjlevel (UTF8);
 optional binary calcobjboidscheme (UTF8);
 }
 }
 }
 optional group riskfactorinfomap (MAP) {
 repeated group key_value {
 required binary key (UTF8);
 optional group value {
 optional binary riskfactorname (UTF8);
 optional binary riskfactortype (UTF8);
 optional binary riskfactorrole (UTF8);
 }
 }
 }
}

{code}
We indeed have 2 Map field with a value fields named 'value'. The name 'value' 
is defaulted in org.apache.spark.sql.types.MapType. 

The fix seems not trivial given current parquet-avro code then I doubt I will 
be able to craft a valid PR without directions.


Thanks,


> Add differentiation of nested records with the same name
> --------------------------------------------------------
>
>                 Key: PARQUET-1202
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1202
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.7.0, 1.8.2
>            Reporter: Benoit Lacelle
>            Priority: Major
>
> Hello,
> While reading back a Parquet file produced with Spark, it appears the schema 
> produced by Parquet-Avro is not valid.
> I consider the simple following piece of code:
> {code}
> ParquetReader<GenericRecord> reader =
>              AvroParquetReader.<GenericRecord>builder(new 
> org.apache.hadoop.fs.Path(path.toUri())).build();
>              System.out.println(reader.read().getSchema());
> {code}
> I get a stack lile:
> {code}
> Exception in thread "main" +org.apache.avro.SchemaParseException+: Can't 
> redefine: value
>        at org.apache.avro.Schema$Names.put(+Schema.java:1128+)
>        at org.apache.avro.Schema$NamedSchema.writeNameRef(+Schema.java:562+)
>        at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:690+)
>        at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
>        at org.apache.avro.Schema$MapSchema.toJson(+Schema.java:833+)
>        at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
>        at org.apache.avro.Schema$RecordSchema.fieldsToJson(+Schema.java:716+)
>        at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:701+)
>        at org.apache.avro.Schema.toString(+Schema.java:324+)
>        at org.apache.avro.Schema.toString(+Schema.java:314+)
> {code}
>  
> The issue seems the same as the one reported in:
> [https://www.bountysource.com/issues/22823013-spark-avro-fails-to-save-df-with-nested-records-having-the-same-name]
>  
> It have been fixed in Spark-avro within:
> [https://github.com/databricks/spark-avro/pull/73]
> In our case, the parquet schema looks like:
> {code}
> message spark_schema {
>        optional group calculatedobjectinfomap (MAP) {
>                repeated group key_value {
>                        required binary key (UTF8);
>                        optional group value {
>                                optional int64 calcobjid;
>                                optional int64 calcobjparentid;
>                                optional binary portfolioname (UTF8);
>                                optional binary portfolioscheme (UTF8);
>                                optional binary calcobjtype (UTF8);
>                                optional binary calcobjmnemonic (UTF8);
>                                optional binary calcobinstrumentype (UTF8);
>                                optional int64 calcobjectqty;
>                                optional binary calcobjboid (UTF8);
>                                optional binary analyticalfoldermnemonic 
> (UTF8);
>                                optional binary calculatedidentifier (UTF8);
>                                optional binary calcobjlevel (UTF8);
>                                optional binary calcobjboidscheme (UTF8);
>                        }
>               }
>       }
>       optional group riskfactorinfomap (MAP) {
>                repeated group key_value {
>                        required binary key (UTF8);
>                        optional group value {
>                        optional binary riskfactorname (UTF8);
>                        optional binary riskfactortype (UTF8);
>                        optional binary riskfactorrole (UTF8);
>                        }
>                }
>        }
> }
> {code}
> We indeed have 2 Map field with a value fields named 'value'. The name 
> 'value' is defaulted in org.apache.spark.sql.types.MapType. 
> The fix seems not trivial given current parquet-avro code then I doubt I will 
> be able to craft a valid PR without directions.
> Thanks,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to