Re: [PR] AVRO-3918: add uuid with bytes and fixed [avro]

via GitHub Thu, 28 Dec 2023 09:30:15 -0800


Fokko commented on code in PR #2652:
URL: https://github.com/apache/avro/pull/2652#discussion_r1437795342



##########
doc/content/en/docs/++version++/Specification/_index.md:
##########
@@ -827,7 +827,31 @@ Here, as scale property is stored in value itself it needs 
more bytes than prece
 ### UUID
 The `uuid` logical type represents a random generated universally unique 
identifier (UUID).
 
-A `uuid` logical type annotates an Avro `string`. The string has to conform 
with [RFC-4122](https://www.ietf.org/rfc/rfc4122.txt)
+A `uuid` logical type annotates an Avro `string`, `fixed` of length 16 or 
`bytes`. The string has to conform with 
[RFC-4122](https://www.ietf.org/rfc/rfc4122.txt)
+
+The following schemas represent a uuid:
+```json
+{
+  "type": "string",
+  "logicalType": "uuid"
+}
+```
+
+```json
+{
+  "type": "fixed",
+  "size": "16",
+  "logicalType": "uuid"
+}
+```
+
+```json
+{
+  "type": "bytes",
+  "logicalType": "uuid"
+}
+```

Review Comment:
   I don't think this adds value since a UUID is always 128 bits. Storing it as 
bytes will be suboptimal since you first need to encode the length of the bytes 
which is always the same.



##########
lang/java/avro/src/main/java/org/apache/avro/LogicalTypes.java:
##########
@@ -296,8 +296,12 @@ private Uuid() {
     @Override
     public void validate(Schema schema) {
       super.validate(schema);
-      if (schema.getType() != Schema.Type.STRING) {
-        throw new IllegalArgumentException("Uuid can only be used with an 
underlying string type");
+      if (schema.getType() != Schema.Type.STRING && schema.getType() != 
Schema.Type.FIXED
+          && schema.getType() != Schema.Type.BYTES) {
+        throw new IllegalArgumentException("Uuid can only be used with an 
underlying string, fixed or byte type");
+      }
+      if (schema.getType() == Schema.Type.FIXED && schema.getFixedSize() != 2 
* Long.BYTES) {

Review Comment:
   Should we create a new constant with `UUID_BYTES`?



##########
doc/content/en/docs/++version++/Specification/_index.md:
##########
@@ -827,7 +827,31 @@ Here, as scale property is stored in value itself it needs 
more bytes than prece
 ### UUID
 The `uuid` logical type represents a random generated universally unique 
identifier (UUID).
 
-A `uuid` logical type annotates an Avro `string`. The string has to conform 
with [RFC-4122](https://www.ietf.org/rfc/rfc4122.txt)
+A `uuid` logical type annotates an Avro `string`, `fixed` of length 16 or 
`bytes`. The string has to conform with 
[RFC-4122](https://www.ietf.org/rfc/rfc4122.txt)
+
+The following schemas represent a uuid:
+```json
+{
+  "type": "string",
+  "logicalType": "uuid"
+}
+```
+
+```json
+{
+  "type": "fixed",
+  "size": "16",
+  "logicalType": "uuid"
+}
+```
+
+```json
+{
+  "type": "bytes",
+  "logicalType": "uuid"
+}
+```
+(UUID will be sorted differently if the underlying type is a string, a fixed 
or a bytes)

Review Comment:
   I would not expect this to be true, but I have to dig into the details.



##########
lang/java/avro/src/test/java/org/apache/avro/TestLogicalType.java:
##########
@@ -208,7 +208,7 @@ void uuidExtendsString() {
     assertEquals(LogicalTypes.uuid(), uuidSchema.getLogicalType());
 
     assertThrows("UUID requires a string", IllegalArgumentException.class,
-        "Uuid can only be used with an underlying string type",
+        "Uuid can only be used with an underlying string, fixed or byte type",

Review Comment:
   The current proposal does not include bytes.
   ```suggestion
           "Uuid can only be used with an underlying string or fixed type",
   ```



##########
lang/java/avro/src/main/java/org/apache/avro/Conversions.java:
##########
@@ -68,6 +80,68 @@ public UUID fromCharSequence(CharSequence value, Schema 
schema, LogicalType type
     public CharSequence toCharSequence(UUID value, Schema schema, LogicalType 
type) {
       return value.toString();
     }
+
+    @Override
+    public UUID fromFixed(final GenericFixed value, final Schema schema, final 
LogicalType type) {
+      long mostSigBits = 0;
+      long leastSigBits = 0;
+      byte[] bytes = value.bytes();
+      for (int i = 0; i < Long.BYTES; i++) {
+        mostSigBits |= ((long) (bytes[i] & 255)) << (Byte.SIZE * i);
+        leastSigBits |= ((long) (bytes[i + Long.BYTES] & 255)) << (Byte.SIZE * 
i);
+      }
+
+      return new UUID(this.convert(mostSigBits), this.convert(leastSigBits));
+    }
+
+    private long convert(long value) {
+      if (this.isBigEndian) {
+        return value;
+      } else {
+        return Long.reverseBytes(value);
+      }
+    }
+
+    @Override
+    public UUID fromBytes(final ByteBuffer value, final Schema schema, final 
LogicalType type) {
+      BinaryDecoder decoder = 
DecoderFactory.get().binaryDecoder(value.array(), null);
+      try {
+        final long mostSigBits = decoder.readLong();

Review Comment:
   In Iceberg we [first read it to a 
buffer](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/avro/ValueReaders.java#L255),
 but I think your version here is more efficient. Let me do some testing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] AVRO-3918: add uuid with bytes and fixed [avro]

Reply via email to