[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr

2020-06-03 Thread GitBox


gszadovszky commented on a change in pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778#discussion_r434550078



##
File path: 
parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java
##
@@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() 
throws Exception {
 testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema);
   }
 
+  @Test
+  public void testUUIDType() throws Exception {
+Schema fromAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", 
LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null)));
+String parquet = "message myrecord {\n" +
+"  required binary uuid (STRING);\n" +
+"}\n";
+Schema toAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, 
null)));
+
+testAvroToParquetConversion(fromAvro, parquet);
+testParquetToAvroConversion(toAvro, parquet);
+  }

Review comment:
   The 
[`testRoundTripConversion`](https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java#L144)
 I'm using in `testUUIDTypeWithParquetUUID` is actually stronger than the one 
you suggested: it checks for equality (in two phases) of the initial and the 
result avro schemas (and not only for compatibility). For `testUUIDType`, 
though it is a good idea to check the compatibility of the avro schemas.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr

2020-06-02 Thread GitBox


gszadovszky commented on a change in pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778#discussion_r433681475



##
File path: 
parquet-column/src/test/java/org/apache/parquet/schema/TestPrimitiveStringifier.java
##
@@ -309,6 +308,35 @@ public void testDecimalStringifier() {
 checkThrowingUnsupportedException(stringifier, Integer.TYPE, Long.TYPE, 
Binary.class);
   }
 
+  @Test
+  public void testUUIDStringifier() {
+PrimitiveStringifier stringifier = PrimitiveStringifier.UUID_STRINGIFIER;
+
+assertEquals("00112233-4455-6677-8899-aabbccddeeff", stringifier.stringify(
+toBinary(0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 
0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff)));
+assertEquals("----", stringifier.stringify(
+toBinary(0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 
0x00, 0x00, 0x00, 0x00, 0x00, 0x00)));
+assertEquals("----", stringifier.stringify(
+toBinary(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 
0xff, 0xff, 0xff, 0xff, 0xff, 0xff)));
+
+assertEquals("0eb1497c-19b6-42bc-b028-b4b612bed141", stringifier.stringify(

Review comment:
   I'm happy to add tests for the edge cases like too short or too long 
inputs. Though, I would not implement additional validations because of 
performance issues. A `stringify` method would be invoked on each values; an 
additional check would highly impact performance even if it is only used from 
the tools and not really in production. A `Stringifier` is associated to the 
value at schema level which means it shall never happen that the value is 
invalid. That's why the `Stringifier` implementations do not validate the 
values.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr

2020-06-02 Thread GitBox


gszadovszky commented on a change in pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778#discussion_r433678473



##
File path: 
parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java
##
@@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() 
throws Exception {
 testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema);
   }
 
+  @Test
+  public void testUUIDType() throws Exception {
+Schema fromAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", 
LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null)));
+String parquet = "message myrecord {\n" +
+"  required binary uuid (STRING);\n" +
+"}\n";
+Schema toAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, 
null)));
+
+testAvroToParquetConversion(fromAvro, parquet);
+testParquetToAvroConversion(toAvro, parquet);
+  }

Review comment:
   I'll look into this just did not have time to work on this PR. Thanks a 
lot for reviewing. :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr

2020-05-07 Thread GitBox


gszadovszky commented on a change in pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778#discussion_r421389575



##
File path: 
parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java
##
@@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() 
throws Exception {
 testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema);
   }
 
+  @Test
+  public void testUUIDType() throws Exception {
+Schema fromAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", 
LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null)));
+String parquet = "message myrecord {\n" +
+"  required binary uuid (STRING);\n" +
+"}\n";
+Schema toAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, 
null)));
+
+testAvroToParquetConversion(fromAvro, parquet);
+testParquetToAvroConversion(toAvro, parquet);
+  }

Review comment:
   To be honest I am not too familiar with parquet-avro. I've made the 
changes based on the implementation/test of other logical types. Could you 
explain it in more details what you would test exactly?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr

2020-05-07 Thread GitBox


gszadovszky commented on a change in pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778#discussion_r421384119



##
File path: parquet-avro/README.md
##
@@ -0,0 +1,44 @@

[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr

2020-05-07 Thread GitBox


gszadovszky commented on a change in pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778#discussion_r421381112



##
File path: 
parquet-column/src/test/java/org/apache/parquet/schema/TestPrimitiveStringifier.java
##
@@ -309,6 +308,35 @@ public void testDecimalStringifier() {
 checkThrowingUnsupportedException(stringifier, Integer.TYPE, Long.TYPE, 
Binary.class);
   }
 
+  @Test
+  public void testUUIDStringifier() {
+PrimitiveStringifier stringifier = PrimitiveStringifier.UUID_STRINGIFIER;
+
+assertEquals("00112233-4455-6677-8899-aabbccddeeff", stringifier.stringify(
+toBinary(0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 
0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff)));
+assertEquals("----", stringifier.stringify(
+toBinary(0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 
0x00, 0x00, 0x00, 0x00, 0x00, 0x00)));
+assertEquals("----", stringifier.stringify(
+toBinary(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 
0xff, 0xff, 0xff, 0xff, 0xff, 0xff)));
+
+assertEquals("0eb1497c-19b6-42bc-b028-b4b612bed141", stringifier.stringify(

Review comment:
   The idea is to have 3 kind-of corner cases and 3 common (random but 
constant) values. What do you mean by duplicate coverage? (I think, we do not 
need exhaustive testing for the stringifiers because they only used by our 
tools for debugging purposes.)
   
   The stringifiers do not validate the data they get because of performance 
reasons. So, if the array is longer than 16 it would simply stringify the first 
16 and skip the others. In case of the length is too short then an 
`ArrayIndexOutOfBoundsException` would be thrown. Do you think we should test 
these cases? They would not reach any additional branches in the parquet code.
   Invalid characters are not possible. The full set of values of the 16 bytes 
array is covered in UUID.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr

2020-05-07 Thread GitBox


gszadovszky commented on a change in pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778#discussion_r421372389



##
File path: 
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveStringifier.java
##
@@ -421,4 +422,30 @@ private String stringifyWithScale(BigInteger i) {
   }
 };
   }
+
+  static final PrimitiveStringifier UUID_STRINGIFIER = new 
PrimitiveStringifier("UUID_STRINGIFIER") {
+private final char[] digit = "0123456789abcdef".toCharArray();
+@Override
+public String stringify(Binary value) {
+  byte[] bytes = value.getBytesUnsafe();

Review comment:
   `final` would protect the reference only and not the values of the 
array. Making a reference final in a local scope is usually required in 
situations where it is accessed from e.g. lambda closures.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr

2020-05-07 Thread GitBox


gszadovszky commented on a change in pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778#discussion_r421368850



##
File path: 
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
##
@@ -861,6 +871,36 @@ PrimitiveStringifier valueStringifier(PrimitiveType 
primitiveType) {
 }
   }
 
+  public static class UUIDLogicalTypeAnnotation extends LogicalTypeAnnotation {

Review comment:
   As this is a singleton the default implementation of `equals(Object)` 
and `hashCode()` fits perfectly.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org