alkis commented on code in PR #254: URL: https://github.com/apache/parquet-format/pull/254#discussion_r1706658756
########## README.md: ########## @@ -290,6 +290,46 @@ There are many places in the format for compatible extensions: - Encodings: Encodings are specified by enum and more can be added in the future. - Page types: Additional page types can be added and safely skipped. +### Thrift extensions +Parquet Thrift IDL reserves field-id `32767` of every Thrift struct for extensions. The (Thrift) type of this field is always `binary`. These choices provide some desirable properties: + +* Existing readers will ignore these extensions without any modifications +* Existing readers will ignore the extension bytes with little processing overhead +* The content of the extension is freeform and can be encoded in any format. This format is not restricted to Thrift. +* Extensions can be appended to existing Thrift serialized structs [without requiring Thrift libraries](#appending-extensions-to-thrift) for manipulation (or changes to the thrift IDL). + +Because only one field-id is reserved the extension bytes themselves require disambiguation; otherwise readers will not be able to decode extensions safely. This is left to implementers which MUST put enough unique state in their extension bytes for disambiguation. This can be relatively easily achieved by adding a [UUID](https://en.wikipedia.org/wiki/Universally\_unique\_identifier) at the start or end of the extension bytes. The extension does not specify a disambiguation mechanism to allow more flexibility to implementers. + +Putting everything together in an example, if we would extend `FileMetaData` it would look like this on the wire. + + N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift stop field) + 4 bytes | 08 FF FF 01 (long form header for 32767: binary) + 1-5 bytes | ULEB128(M) encoded size of the extension + M bytes | extension bytes + 1 byte | \0 (thrift stop field) + +The choice to reserve only one field-id has an additional (and frankly unintended) property. It creates scarcity in the extension space and disincentivizes vendors from keeping their extensions private. As a vendor having an extension means one cannot use it in tandem with other extensions from other vendors even if such extensions are publicly known. The easiest path of interoperability and ability to further experiment is to push an extension through standardization and continue experimenting with other ideas internally on top of the (now) standardized version. + +#### Appending extensions to thrift + +```c++ +std::string AppendExtension(std::string thrift, std::string ext) { + thrift.pop_back(); // remove the stop field + thrift += "\x08"; // binary + AppendUleb(32767, &thrift); // field-id + AppendUleb(ext.size(), &thrift); // field isze Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
