alkis commented on code in PR #254:
URL: https://github.com/apache/parquet-format/pull/254#discussion_r1706658756


##########
README.md:
##########
@@ -290,6 +290,46 @@ There are many places in the format for compatible 
extensions:
 - Encodings: Encodings are specified by enum and more can be added in the 
future.
 - Page types: Additional page types can be added and safely skipped.
 
+### Thrift extensions
+Parquet Thrift IDL reserves field-id `32767` of every Thrift struct for 
extensions. The (Thrift) type of this field is always `binary`. These choices 
provide some desirable properties:
+
+* Existing readers will ignore these extensions without any modifications  
+* Existing readers will ignore the extension bytes with little processing 
overhead  
+* The content of the extension is freeform and can be encoded in any format. 
This format is not restricted to Thrift.  
+* Extensions can be appended to existing Thrift serialized structs [without 
requiring Thrift libraries](#appending-extensions-to-thrift) for manipulation 
(or changes to the thrift IDL).
+
+Because only one field-id is reserved the extension bytes themselves require 
disambiguation; otherwise readers will not be able to decode extensions safely. 
This is left to implementers which MUST put enough unique state in their 
extension bytes for disambiguation. This can be relatively easily achieved by 
adding a [UUID](https://en.wikipedia.org/wiki/Universally\_unique\_identifier) 
at the start or end of the extension bytes. The extension does not specify a 
disambiguation mechanism to allow more flexibility to implementers.
+
+Putting everything together in an example, if we would extend `FileMetaData` 
it would look like this on the wire.
+
+    N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift 
stop field)
+    4 bytes   | 08 FF FF 01 (long form header for 32767: binary)
+    1-5 bytes | ULEB128(M) encoded size of the extension
+    M bytes   | extension bytes
+    1 byte    | \0 (thrift stop field)
+
+The choice to reserve only one field-id has an additional (and frankly 
unintended) property. It creates scarcity in the extension space and 
disincentivizes vendors from keeping their extensions private. As a vendor 
having an extension means one cannot use it in tandem with other extensions 
from other vendors even if such extensions are publicly known. The easiest path 
of interoperability and ability to further experiment is to push an extension 
through standardization and continue experimenting with other ideas internally 
on top of the (now) standardized version.
+
+#### Appending extensions to thrift
+
+```c++
+std::string AppendExtension(std::string thrift, std::string ext) {
+  thrift.pop_back();                // remove the stop field
+  thrift += "\x08";                 // binary
+  AppendUleb(32767, &thrift);       // field-id
+  AppendUleb(ext.size(), &thrift);  // field isze

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to