[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr
gszadovszky commented on a change in pull request #778: URL: https://github.com/apache/parquet-mr/pull/778#discussion_r434550078 ## File path: parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java ## @@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() throws Exception { testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema); } + @Test + public void testUUIDType() throws Exception { +Schema fromAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null))); +String parquet = "message myrecord {\n" + +" required binary uuid (STRING);\n" + +"}\n"; +Schema toAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, null))); + +testAvroToParquetConversion(fromAvro, parquet); +testParquetToAvroConversion(toAvro, parquet); + } Review comment: The [`testRoundTripConversion`](https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java#L144) I'm using in `testUUIDTypeWithParquetUUID` is actually stronger than the one you suggested: it checks for equality (in two phases) of the initial and the result avro schemas (and not only for compatibility). For `testUUIDType`, though it is a good idea to check the compatibility of the avro schemas. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr
gszadovszky commented on a change in pull request #778: URL: https://github.com/apache/parquet-mr/pull/778#discussion_r433681475 ## File path: parquet-column/src/test/java/org/apache/parquet/schema/TestPrimitiveStringifier.java ## @@ -309,6 +308,35 @@ public void testDecimalStringifier() { checkThrowingUnsupportedException(stringifier, Integer.TYPE, Long.TYPE, Binary.class); } + @Test + public void testUUIDStringifier() { +PrimitiveStringifier stringifier = PrimitiveStringifier.UUID_STRINGIFIER; + +assertEquals("00112233-4455-6677-8899-aabbccddeeff", stringifier.stringify( +toBinary(0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff))); +assertEquals("----", stringifier.stringify( +toBinary(0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00))); +assertEquals("----", stringifier.stringify( +toBinary(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff))); + +assertEquals("0eb1497c-19b6-42bc-b028-b4b612bed141", stringifier.stringify( Review comment: I'm happy to add tests for the edge cases like too short or too long inputs. Though, I would not implement additional validations because of performance issues. A `stringify` method would be invoked on each values; an additional check would highly impact performance even if it is only used from the tools and not really in production. A `Stringifier` is associated to the value at schema level which means it shall never happen that the value is invalid. That's why the `Stringifier` implementations do not validate the values. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr
gszadovszky commented on a change in pull request #778: URL: https://github.com/apache/parquet-mr/pull/778#discussion_r433678473 ## File path: parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java ## @@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() throws Exception { testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema); } + @Test + public void testUUIDType() throws Exception { +Schema fromAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null))); +String parquet = "message myrecord {\n" + +" required binary uuid (STRING);\n" + +"}\n"; +Schema toAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, null))); + +testAvroToParquetConversion(fromAvro, parquet); +testParquetToAvroConversion(toAvro, parquet); + } Review comment: I'll look into this just did not have time to work on this PR. Thanks a lot for reviewing. :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr
gszadovszky commented on a change in pull request #778: URL: https://github.com/apache/parquet-mr/pull/778#discussion_r421389575 ## File path: parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java ## @@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() throws Exception { testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema); } + @Test + public void testUUIDType() throws Exception { +Schema fromAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null))); +String parquet = "message myrecord {\n" + +" required binary uuid (STRING);\n" + +"}\n"; +Schema toAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, null))); + +testAvroToParquetConversion(fromAvro, parquet); +testParquetToAvroConversion(toAvro, parquet); + } Review comment: To be honest I am not too familiar with parquet-avro. I've made the changes based on the implementation/test of other logical types. Could you explain it in more details what you would test exactly? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr
gszadovszky commented on a change in pull request #778: URL: https://github.com/apache/parquet-mr/pull/778#discussion_r421384119 ## File path: parquet-avro/README.md ## @@ -0,0 +1,44 @@
[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr
gszadovszky commented on a change in pull request #778: URL: https://github.com/apache/parquet-mr/pull/778#discussion_r421381112 ## File path: parquet-column/src/test/java/org/apache/parquet/schema/TestPrimitiveStringifier.java ## @@ -309,6 +308,35 @@ public void testDecimalStringifier() { checkThrowingUnsupportedException(stringifier, Integer.TYPE, Long.TYPE, Binary.class); } + @Test + public void testUUIDStringifier() { +PrimitiveStringifier stringifier = PrimitiveStringifier.UUID_STRINGIFIER; + +assertEquals("00112233-4455-6677-8899-aabbccddeeff", stringifier.stringify( +toBinary(0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff))); +assertEquals("----", stringifier.stringify( +toBinary(0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00))); +assertEquals("----", stringifier.stringify( +toBinary(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff))); + +assertEquals("0eb1497c-19b6-42bc-b028-b4b612bed141", stringifier.stringify( Review comment: The idea is to have 3 kind-of corner cases and 3 common (random but constant) values. What do you mean by duplicate coverage? (I think, we do not need exhaustive testing for the stringifiers because they only used by our tools for debugging purposes.) The stringifiers do not validate the data they get because of performance reasons. So, if the array is longer than 16 it would simply stringify the first 16 and skip the others. In case of the length is too short then an `ArrayIndexOutOfBoundsException` would be thrown. Do you think we should test these cases? They would not reach any additional branches in the parquet code. Invalid characters are not possible. The full set of values of the 16 bytes array is covered in UUID. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr
gszadovszky commented on a change in pull request #778: URL: https://github.com/apache/parquet-mr/pull/778#discussion_r421372389 ## File path: parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveStringifier.java ## @@ -421,4 +422,30 @@ private String stringifyWithScale(BigInteger i) { } }; } + + static final PrimitiveStringifier UUID_STRINGIFIER = new PrimitiveStringifier("UUID_STRINGIFIER") { +private final char[] digit = "0123456789abcdef".toCharArray(); +@Override +public String stringify(Binary value) { + byte[] bytes = value.getBytesUnsafe(); Review comment: `final` would protect the reference only and not the values of the array. Making a reference final in a local scope is usually required in situations where it is accessed from e.g. lambda closures. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr
gszadovszky commented on a change in pull request #778: URL: https://github.com/apache/parquet-mr/pull/778#discussion_r421368850 ## File path: parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java ## @@ -861,6 +871,36 @@ PrimitiveStringifier valueStringifier(PrimitiveType primitiveType) { } } + public static class UUIDLogicalTypeAnnotation extends LogicalTypeAnnotation { Review comment: As this is a singleton the default implementation of `equals(Object)` and `hashCode()` fits perfectly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org