Fokko commented on code in PR #2652: URL: https://github.com/apache/avro/pull/2652#discussion_r1437795342
########## doc/content/en/docs/++version++/Specification/_index.md: ########## @@ -827,7 +827,31 @@ Here, as scale property is stored in value itself it needs more bytes than prece ### UUID The `uuid` logical type represents a random generated universally unique identifier (UUID). -A `uuid` logical type annotates an Avro `string`. The string has to conform with [RFC-4122](https://www.ietf.org/rfc/rfc4122.txt) +A `uuid` logical type annotates an Avro `string`, `fixed` of length 16 or `bytes`. The string has to conform with [RFC-4122](https://www.ietf.org/rfc/rfc4122.txt) + +The following schemas represent a uuid: +```json +{ + "type": "string", + "logicalType": "uuid" +} +``` + +```json +{ + "type": "fixed", + "size": "16", + "logicalType": "uuid" +} +``` + +```json +{ + "type": "bytes", + "logicalType": "uuid" +} +``` Review Comment: I don't think this adds value since a UUID is always 128 bits. Storing it as bytes will be suboptimal since you first need to encode the length of the bytes which is always the same. ########## lang/java/avro/src/main/java/org/apache/avro/LogicalTypes.java: ########## @@ -296,8 +296,12 @@ private Uuid() { @Override public void validate(Schema schema) { super.validate(schema); - if (schema.getType() != Schema.Type.STRING) { - throw new IllegalArgumentException("Uuid can only be used with an underlying string type"); + if (schema.getType() != Schema.Type.STRING && schema.getType() != Schema.Type.FIXED + && schema.getType() != Schema.Type.BYTES) { + throw new IllegalArgumentException("Uuid can only be used with an underlying string, fixed or byte type"); + } + if (schema.getType() == Schema.Type.FIXED && schema.getFixedSize() != 2 * Long.BYTES) { Review Comment: Should we create a new constant with `UUID_BYTES`? ########## doc/content/en/docs/++version++/Specification/_index.md: ########## @@ -827,7 +827,31 @@ Here, as scale property is stored in value itself it needs more bytes than prece ### UUID The `uuid` logical type represents a random generated universally unique identifier (UUID). -A `uuid` logical type annotates an Avro `string`. The string has to conform with [RFC-4122](https://www.ietf.org/rfc/rfc4122.txt) +A `uuid` logical type annotates an Avro `string`, `fixed` of length 16 or `bytes`. The string has to conform with [RFC-4122](https://www.ietf.org/rfc/rfc4122.txt) + +The following schemas represent a uuid: +```json +{ + "type": "string", + "logicalType": "uuid" +} +``` + +```json +{ + "type": "fixed", + "size": "16", + "logicalType": "uuid" +} +``` + +```json +{ + "type": "bytes", + "logicalType": "uuid" +} +``` +(UUID will be sorted differently if the underlying type is a string, a fixed or a bytes) Review Comment: I would not expect this to be true, but I have to dig into the details. ########## lang/java/avro/src/test/java/org/apache/avro/TestLogicalType.java: ########## @@ -208,7 +208,7 @@ void uuidExtendsString() { assertEquals(LogicalTypes.uuid(), uuidSchema.getLogicalType()); assertThrows("UUID requires a string", IllegalArgumentException.class, - "Uuid can only be used with an underlying string type", + "Uuid can only be used with an underlying string, fixed or byte type", Review Comment: The current proposal does not include bytes. ```suggestion "Uuid can only be used with an underlying string or fixed type", ``` ########## lang/java/avro/src/main/java/org/apache/avro/Conversions.java: ########## @@ -68,6 +80,68 @@ public UUID fromCharSequence(CharSequence value, Schema schema, LogicalType type public CharSequence toCharSequence(UUID value, Schema schema, LogicalType type) { return value.toString(); } + + @Override + public UUID fromFixed(final GenericFixed value, final Schema schema, final LogicalType type) { + long mostSigBits = 0; + long leastSigBits = 0; + byte[] bytes = value.bytes(); + for (int i = 0; i < Long.BYTES; i++) { + mostSigBits |= ((long) (bytes[i] & 255)) << (Byte.SIZE * i); + leastSigBits |= ((long) (bytes[i + Long.BYTES] & 255)) << (Byte.SIZE * i); + } + + return new UUID(this.convert(mostSigBits), this.convert(leastSigBits)); + } + + private long convert(long value) { + if (this.isBigEndian) { + return value; + } else { + return Long.reverseBytes(value); + } + } + + @Override + public UUID fromBytes(final ByteBuffer value, final Schema schema, final LogicalType type) { + BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(value.array(), null); + try { + final long mostSigBits = decoder.readLong(); Review Comment: In Iceberg we [first read it to a buffer](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/avro/ValueReaders.java#L255), but I think your version here is more efficient. Let me do some testing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
