RussellSpitzer opened a new pull request #2167:
URL: https://github.com/apache/iceberg/pull/2167
Previously the Iceberg conversion functions for Parquet would throw an
exception if
they encountered a Binary type field. This was internally represented as a
repeated
primitive field that is not nested in another group type. This violated some
expecations
within our schema conversion code.
We encountered this with a user who was using Parquet's AvroParquetWriter
class to write Parquet files. The files, while readable by hive and spark, were
not readable by iceberg.
Investigating this I found the following Avro Schema element caused the
problem
```java
String schema = "{\n" +
" \"type\":\"record\",\n" +
" \"name\":\"DbRecord\",\n" +
" \"namespace\":\"com.russ\",\n" +
" \"fields\":[\n" +
" {\n" +
" \"name\":\"foo\",\n" +
" \"type\":[\n" +
" \"null\",\n" +
" {\n" +
" \"type\":\"array\",\n" +
" \"items\":\"bytes\"\n" +
" }\n" +
" ],\n" +
" \"default\":null\n" +
" }\n" +
" ]\n" +
"}";
```
Parquet would convert this element into
```
foo:
OPTIONAL F:1
.array: REPEATED BINARY R:1 D:2
```
Which violates Iceberg's reader, which assumes the list will be nested.
Doing a quick test with
```
org.apache.avro.Schema.Parser parser = new
org.apache.avro.Schema.Parser();
org.apache.avro.Schema avroSchema = parser.parse(schema);
AvroSchemaConverter converter = new AvroSchemaConverter();
MessageType parquetSchema = converter.convert(avroSchema);
```
I saw that this was reproducible in the current version of Parquet and not
just in our User's code.
To fix this I added some tests for this particular datatype and loosened
some of the restrictions
in our Parquet Schema parsing code.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]