alkis commented on code in PR #254:
URL: https://github.com/apache/parquet-format/pull/254#discussion_r1640207092
##########
README.md:
##########
@@ -285,6 +285,108 @@ There are many places in the format for compatible
extensions:
- Encodings: Encodings are specified by enum and more can be added in the
future.
- Page types: Additional page types can be added and safely skipped.
+### Thrift extensions
+Thrift is used for metadata. The Thrift spec mandates that unknown fields are
+skipped. To facilitate extensions Parquet reserves field-id `32767` of *every*
+struct as an ignorable extension point. More specifically Parquet guarantees
+that field-id `32767` will *never* be seen in the official Thrift IDL. The type
+of this field is always `binary` for maximum extensibility and fast skipping by
+thrift parsers.
+
+Such extensions can easily be appended to an existing Thrift serialized message
+without any special APIs. Sample `C++` implementation is provided:
+
+```c++
+std::string AppendExtension(std::string thrift, std::string ext) {
+ auto append_uleb = [](uint32_t x, std::string* out) {
+ while (true) {
+ int c = x & 0x7F;
+ if ((x >>= 7) == 0) {
+ out->push_back(c);
+ return;
+ } else {
+ out->push_back(c | 0x80);
+ }
+ }
+ };
+ thrift.pop_back(); // remove the trailing 0
+ thrift += "\x08\xFF\xFF\x01"; // long form field header for 32767: binary
+ append_uleb(ext.size(), &thrift);
+ thrift += ext;
+ thrift += "\x00"; // add the trailing 0 back
+ return thrift;
+}
+```
+
+To facilitate independence of extensions between organizations the last 3 bytes
+of an extension contain a magic number. The current reserved magic numbers are:
+
+| Magic | Organization |
+|-------|--------------|
+| `PAR` | Reserved for the future when an extension replaces `PAR1` |
+| `PER` | Reserved for the future when an extension replaces `PARE` |
+| `ASF` | Apache |
+| `AWS` | Amazon |
+| `CDH` | Cloudera |
+| `CRM` | Salesforce |
+| `DBR` | Databricks |
+| `EXP` | Apache/Experimental |
Review Comment:
It is exclusive because the extension mechanism covers: custom/internal
extensions, experimental public extensions, and transition period for new
additions to the standard.
I think an example will make this clearer.
The way I see this playing out is:
1. different vendors add their own extensions in a way that do not interfere
with one another. By construction only one extension can be in a file - since a
writer is owned by one vendor and will write that vendors extension. The
disambiguator string here so that readers that know about *one* (or more)
extensions can check if the extension that is present is one they can read.
Some of these extensions will never be made public.
2. at some point a vendor say `NVD` wants to make their extension public by
including it in the standard. At this point there is a formal proposal to
include it in the format as a first class part of metadata.
3. the extension is reviewed and accepted - possibly in a form slightly
different that proposed
4. the accepted proposal takes magic `PAR` and is now public. For a
transition period the format contains the new format as an extension so that
readers have time to be upgraded. This is one of the reasons we pre-reserve
`PAR` now.
5. once the transition period is over, the extension verbatim becomes the
new footer, but now without the thrift field header because it is no longer an
extension. At this point we also say that the new magic for the parquet file in
this version is `PAR\0`. This will require very little changes to readers that
already supported (4).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]