Re: [PR] [T3] Add extension points for all thrift messages [parquet-format]

via GitHub Sat, 01 Jun 2024 02:34:04 -0700


alkis commented on code in PR #254:
URL: https://github.com/apache/parquet-format/pull/254#discussion_r1623189082



##########
README.md:
##########
@@ -285,6 +285,61 @@ There are many places in the format for compatible 
extensions:
 - Encodings: Encodings are specified by enum and more can be added in the 
future.
 - Page types: Additional page types can be added and safely skipped.
 
+### Thrift extensions
+Thrift is used for metadata. The Thrift spec mandates that unknown fields are
+skipped. To facilitate extensions Parquet reserves field-id `32767` of *every*
+struct as an ignorable extension point. More specifically Parquet guarantees
+that field-id `32767` will *never* be seen in the official Thrift IDL. The type
+of this field is always `binary`. 
+
+Such extensions can easily be appended to an existing Thrift serialized message
+without any special APIs. Sample `C++` implementation is provided:
+
+```c++
+std::string AppendExtension(std::string thrift, std::string ext) {
+  auto append_uleb = [](uint32_t x, std::string* out) {
+    while (true) {
+      int c = x & 0x7F;
+      if ((x >>= 7) == 0) {
+        out->push_back(c);
+        return;
+      } else {
+        out->push_back(c | 0x80);
+      }
+    }
+  };
+  thrift.pop_back();  // remove the trailing 0
+  thrift += "\x08\xFF\xFF\x01";
+  append_uleb(ext.size(), &thrift);
+  thrift += ext;
+  thrift += "\x00";  // add the trailing 0 back
+  return thrift;
+}
+```
+
+Additionally the binary extension MUST have a specific form in order to be
+unambiguously identifiable by parsers that know of it and, as a corollary,
+impossible to be accidentally generated by user data.
+
+```
+N bytes: the extension data

Review Comment:
   In practice they will start with some ID because vendors will experiment and 
need to disambiguate internally. But I do not want to specify this since it 
won't make it in the official spec.
   
   For example, Databricks experiments with footers. It uses the above 
mechanism and rolls this out to several customers. The first byte contains some 
id for the blob:
   1. some custom encoding to speed up colchunk offset parsing only, with the 
idea that this is enough to pipeline data I/O with metadata parsing
   2. some more refinement to the encoding
   3. another experiment, now with more structured data, say flatbuffers
   4. refinement of (3)
   5. etc
   6. Final version proposed for inclusion to the spec
   At this point (6) does not have to end up in the official spec. The way it 
will be added is either through changing the magic of parquet, or through 
adding a new binary field in `FileMetaData` that contains the final version of 
the encoding.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [T3] Add extension points for all thrift messages [parquet-format]

Reply via email to