[
https://issues.apache.org/jira/browse/PARQUET-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benoit Lacelle updated PARQUET-1202:
------------------------------------
Description:
Hello,
While reading back a Parquet file produced with Spark, it appears the schema
produced by Parquet-Avro is not valid.
I consider the simple following piece of code:
{code}
ParquetReader<GenericRecord> reader =
AvroParquetReader.<GenericRecord>builder(new
org.apache.hadoop.fs.Path(path.toUri())).build();
System.out.println(reader.read().getSchema());
{code}
I get a stack lile:
{code}
Exception in thread "main" +org.apache.avro.SchemaParseException+: Can't
redefine: value
at org.apache.avro.Schema$Names.put(+Schema.java:1128+)
at org.apache.avro.Schema$NamedSchema.writeNameRef(+Schema.java:562+)
at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:690+)
at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
at org.apache.avro.Schema$MapSchema.toJson(+Schema.java:833+)
at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(+Schema.java:716+)
at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:701+)
at org.apache.avro.Schema.toString(+Schema.java:324+)
at org.apache.avro.Schema.toString(+Schema.java:314+)
{code}
The issue seems the same as the one reported in:
[https://www.bountysource.com/issues/22823013-spark-avro-fails-to-save-df-with-nested-records-having-the-same-name]
It have been fixed in Spark-avro within:
[https://github.com/databricks/spark-avro/pull/73]
In our case, the parquet schema looks like:
{code}
message spark_schema {
optional group calculatedobjectinfomap (MAP) {
repeated group key_value {
required binary key (UTF8);
optional group value {
optional int64 calcobjid;
optional int64 calcobjparentid;
optional binary portfolioname (UTF8);
optional binary portfolioscheme (UTF8);
optional binary calcobjtype (UTF8);
optional binary calcobjmnemonic (UTF8);
optional binary calcobinstrumentype (UTF8);
optional int64 calcobjectqty;
optional binary calcobjboid (UTF8);
optional binary analyticalfoldermnemonic
(UTF8);
optional binary calculatedidentifier (UTF8);
optional binary calcobjlevel (UTF8);
optional binary calcobjboidscheme (UTF8);
}
}
}
optional group riskfactorinfomap (MAP) {
repeated group key_value {
required binary key (UTF8);
optional group value {
optional binary riskfactorname (UTF8);
optional binary riskfactortype (UTF8);
optional binary riskfactorrole (UTF8);
}
}
}
}
{code}
We indeed have 2 Map field with a value fields named 'value'. The name 'value'
is defaulted in org.apache.spark.sql.types.MapType.
The fix seems not trivial given current parquet-avro code then I doubt I will
be able to craft a valid PR without directions.
Thanks,
was:
Hello,
While reading back a Parquet file produced with Spark, it appears the schema
produced by Parquet-Avro is not valid.
I consider the simple following piece of code:
{code}
ParquetReader<GenericRecord> reader =
AvroParquetReader.<GenericRecord>builder(new
org.apache.hadoop.fs.Path(path.toUri())).build();
System.out.println(reader.read().getSchema());
{code}
I get a stack lile:
{code}
Exception in thread "main" +org.apache.avro.SchemaParseException+: Can't
redefine: value
at org.apache.avro.Schema$Names.put(+Schema.java:1128+)
at org.apache.avro.Schema$NamedSchema.writeNameRef(+Schema.java:562+)
at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:690+)
at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
at org.apache.avro.Schema$MapSchema.toJson(+Schema.java:833+)
at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
at org.apache.avro.Schema$RecordSchema.fieldsToJson(+Schema.java:716+)
at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:701+)
at org.apache.avro.Schema.toString(+Schema.java:324+)
at org.apache.avro.Schema.toString(+Schema.java:314+)
{code}
The issue seems the same as the one reported in:
[https://www.bountysource.com/issues/22823013-spark-avro-fails-to-save-df-with-nested-records-having-the-same-name]
It have been fixed in Spark-avro within:
[https://github.com/databricks/spark-avro/pull/73]
In our case, the parquet schema looks like:
{code}
message spark_schema {
optional group calculatedobjectinfomap (MAP) {
repeated group key_value {
required binary key (UTF8);
optional group value {
optional int64 calcobjid;
optional int64 calcobjparentid;
optional binary portfolioname (UTF8);
optional binary portfolioscheme (UTF8);
optional binary calcobjtype (UTF8);
optional binary calcobjmnemonic (UTF8);
optional binary calcobinstrumentype (UTF8);
optional int64 calcobjectqty;
optional binary calcobjboid (UTF8);
optional binary analyticalfoldermnemonic (UTF8);
optional binary calculatedidentifier (UTF8);
optional binary calcobjlevel (UTF8);
optional binary calcobjboidscheme (UTF8);
}
}
}
optional group riskfactorinfomap (MAP) {
repeated group key_value {
required binary key (UTF8);
optional group value {
optional binary riskfactorname (UTF8);
optional binary riskfactortype (UTF8);
optional binary riskfactorrole (UTF8);
}
}
}
}
{code}
We indeed have 2 Map field with a value fields named 'value'. The name 'value'
is defaulted in org.apache.spark.sql.types.MapType.
The fix seems not trivial given current parquet-avro code then I doubt I will
be able to craft a valid PR without directions.
Thanks,
> Add differentiation of nested records with the same name
> --------------------------------------------------------
>
> Key: PARQUET-1202
> URL: https://issues.apache.org/jira/browse/PARQUET-1202
> Project: Parquet
> Issue Type: Bug
> Components: parquet-avro
> Affects Versions: 1.7.0, 1.8.2
> Reporter: Benoit Lacelle
> Priority: Major
>
> Hello,
> While reading back a Parquet file produced with Spark, it appears the schema
> produced by Parquet-Avro is not valid.
> I consider the simple following piece of code:
> {code}
> ParquetReader<GenericRecord> reader =
> AvroParquetReader.<GenericRecord>builder(new
> org.apache.hadoop.fs.Path(path.toUri())).build();
> System.out.println(reader.read().getSchema());
> {code}
> I get a stack lile:
> {code}
> Exception in thread "main" +org.apache.avro.SchemaParseException+: Can't
> redefine: value
> at org.apache.avro.Schema$Names.put(+Schema.java:1128+)
> at org.apache.avro.Schema$NamedSchema.writeNameRef(+Schema.java:562+)
> at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:690+)
> at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
> at org.apache.avro.Schema$MapSchema.toJson(+Schema.java:833+)
> at org.apache.avro.Schema$UnionSchema.toJson(+Schema.java:882+)
> at org.apache.avro.Schema$RecordSchema.fieldsToJson(+Schema.java:716+)
> at org.apache.avro.Schema$RecordSchema.toJson(+Schema.java:701+)
> at org.apache.avro.Schema.toString(+Schema.java:324+)
> at org.apache.avro.Schema.toString(+Schema.java:314+)
> {code}
>
> The issue seems the same as the one reported in:
> [https://www.bountysource.com/issues/22823013-spark-avro-fails-to-save-df-with-nested-records-having-the-same-name]
>
> It have been fixed in Spark-avro within:
> [https://github.com/databricks/spark-avro/pull/73]
> In our case, the parquet schema looks like:
> {code}
> message spark_schema {
> optional group calculatedobjectinfomap (MAP) {
> repeated group key_value {
> required binary key (UTF8);
> optional group value {
> optional int64 calcobjid;
> optional int64 calcobjparentid;
> optional binary portfolioname (UTF8);
> optional binary portfolioscheme (UTF8);
> optional binary calcobjtype (UTF8);
> optional binary calcobjmnemonic (UTF8);
> optional binary calcobinstrumentype (UTF8);
> optional int64 calcobjectqty;
> optional binary calcobjboid (UTF8);
> optional binary analyticalfoldermnemonic
> (UTF8);
> optional binary calculatedidentifier (UTF8);
> optional binary calcobjlevel (UTF8);
> optional binary calcobjboidscheme (UTF8);
> }
> }
> }
> optional group riskfactorinfomap (MAP) {
> repeated group key_value {
> required binary key (UTF8);
> optional group value {
> optional binary riskfactorname (UTF8);
> optional binary riskfactortype (UTF8);
> optional binary riskfactorrole (UTF8);
> }
> }
> }
> }
> {code}
> We indeed have 2 Map field with a value fields named 'value'. The name
> 'value' is defaulted in org.apache.spark.sql.types.MapType.
> The fix seems not trivial given current parquet-avro code then I doubt I will
> be able to craft a valid PR without directions.
> Thanks,
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)