openinx opened a new issue #1578: URL: https://github.com/apache/iceberg/issues/1578
When I write few unit tests for https://github.com/apache/iceberg/pull/1477/files, I found that the encode/decode test would not pass because of the AvroSchemaUtil conversion issue. The test is easy to understand: ```java package org.apache.iceberg.avro; import java.io.IOException; import java.util.List; import org.apache.avro.Schema; import org.apache.avro.generic.GenericData; import org.junit.Assert; public class TestAvroEncoderUtil extends AvroDataTest { @Override protected void writeAndValidate(org.apache.iceberg.Schema schema) throws IOException { List<GenericData.Record> expected = RandomAvroData.generate(schema, 100, 1990L); Schema avroSchema = AvroSchemaUtil.convert(schema.asStruct()); for (GenericData.Record record : expected) { byte[] serializedData = AvroEncoderUtil.encode(record, avroSchema); GenericData.Record expectedRecord = AvroEncoderUtil.decode(serializedData); byte[] serializedData2 = AvroEncoderUtil.encode(expectedRecord, avroSchema); Assert.assertArrayEquals(serializedData2, serializedData); } ``` After digging into this issue, I found that the cause is [here](https://github.com/apache/iceberg/commit/d8cecc411daf16955963766fa6336d4260e7c797#diff-192650b1711edcd50a73986ec880528cR144). For example, if we convert the simple iceberg schema to avro schema: ```java Schema schema = new Schema( required(0, "id", Types.LongType.get()), optional(1, "data", Types.MapType.ofOptional(2, 3, Types.LongType.get(), Types.StringType.get()))); org.apache.avro.Schema avroSchema = AvroSchemaUtil.convert(schema.asStruct()); System.out.println(avroSchema.toString(true)); ``` We will get ```json { "type" : "record", "name" : "rnull", "fields" : [ { "name" : "id", "type" : "long", "field-id" : 0 }, { "name" : "data", "type" : [ "null", { "type" : "array", // <- it will add an array ??? That's quite confusing ? "items" : { "type" : "record", "name" : "k2_v3", "fields" : [ { "name" : "key", "type" : "long", "field-id" : 2 }, { "name" : "value", "type" : [ "null", "string" ], "default" : null, "field-id" : 3 } ] }, "logicalType" : "map" } ], "default" : null, "field-id" : 1 } ] } ``` For my understanding, it should be the normal json: ```json { "type" : "record", "name" : "rnull", "fields" : [ { "name" : "id", "type" : "long", "field-id" : 0 }, { "name" : "data", "type" : [ "null", { "type" : "record", "name" : "k2_v3", "fields" : [ { "name" : "key", "type" : "long", "field-id" : 2 }, { "name" : "value", "type" : [ "null", "string" ], "default" : null, "field-id" : 3 } ], "logicalType" : "map" } ], "default" : null, "field-id" : 1 } ] } ``` What's the reason that we plan to accomplish like that ? Not quite understand the log message from the commit https://github.com/apache/iceberg/commit/d8cecc411daf16955963766fa6336d4260e7c797 actually. @rdblue ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
