alkis commented on code in PR #254:
URL: https://github.com/apache/parquet-format/pull/254#discussion_r1683802898


##########
README.md:
##########
@@ -285,6 +285,108 @@ There are many places in the format for compatible 
extensions:
 - Encodings: Encodings are specified by enum and more can be added in the 
future.
 - Page types: Additional page types can be added and safely skipped.
 
+### Thrift extensions
+Thrift is used for metadata. The Thrift spec mandates that unknown fields are
+skipped. To facilitate extensions Parquet reserves field-id `32767` of *every*
+struct as an ignorable extension point. More specifically Parquet guarantees
+that field-id `32767` will *never* be seen in the official Thrift IDL. The type
+of this field is always `binary` for maximum extensibility and fast skipping by
+thrift parsers.
+
+Such extensions can easily be appended to an existing Thrift serialized message
+without any special APIs. Sample `C++` implementation is provided:
+
+```c++
+std::string AppendExtension(std::string thrift, std::string ext) {
+  auto append_uleb = [](uint32_t x, std::string* out) {
+    while (true) {
+      int c = x & 0x7F;
+      if ((x >>= 7) == 0) {
+        out->push_back(c);
+        return;
+      } else {
+        out->push_back(c | 0x80);
+      }
+    }
+  };
+  thrift.pop_back();  // remove the trailing 0
+  thrift += "\x08\xFF\xFF\x01";  // long form field header for 32767: binary
+  append_uleb(ext.size(), &thrift);
+  thrift += ext;
+  thrift += "\x00";  // add the trailing 0 back
+  return thrift;
+}
+```
+
+To facilitate independence of extensions between organizations the last 3 bytes
+of an extension contain a magic number. The current reserved magic numbers are:
+
+| Magic | Organization |
+|-------|--------------|
+| `PAR` | Reserved for the future when an extension replaces `PAR1` |
+| `PER` | Reserved for the future when an extension replaces `PARE` |
+| `ASF` | Apache |
+| `AWS` | Amazon |
+| `CDH` | Cloudera |
+| `CRM` | Salesforce |
+| `DBR` | Databricks |
+| `EXP` | Apache/Experimental |

Review Comment:
   This was discussed in Parquet Sync on Jul 17th and was switched to UUID.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to