alkis commented on code in PR #254:
URL: https://github.com/apache/parquet-format/pull/254#discussion_r1640207092


##########
README.md:
##########
@@ -285,6 +285,108 @@ There are many places in the format for compatible 
extensions:
 - Encodings: Encodings are specified by enum and more can be added in the 
future.
 - Page types: Additional page types can be added and safely skipped.
 
+### Thrift extensions
+Thrift is used for metadata. The Thrift spec mandates that unknown fields are
+skipped. To facilitate extensions Parquet reserves field-id `32767` of *every*
+struct as an ignorable extension point. More specifically Parquet guarantees
+that field-id `32767` will *never* be seen in the official Thrift IDL. The type
+of this field is always `binary` for maximum extensibility and fast skipping by
+thrift parsers.
+
+Such extensions can easily be appended to an existing Thrift serialized message
+without any special APIs. Sample `C++` implementation is provided:
+
+```c++
+std::string AppendExtension(std::string thrift, std::string ext) {
+  auto append_uleb = [](uint32_t x, std::string* out) {
+    while (true) {
+      int c = x & 0x7F;
+      if ((x >>= 7) == 0) {
+        out->push_back(c);
+        return;
+      } else {
+        out->push_back(c | 0x80);
+      }
+    }
+  };
+  thrift.pop_back();  // remove the trailing 0
+  thrift += "\x08\xFF\xFF\x01";  // long form field header for 32767: binary
+  append_uleb(ext.size(), &thrift);
+  thrift += ext;
+  thrift += "\x00";  // add the trailing 0 back
+  return thrift;
+}
+```
+
+To facilitate independence of extensions between organizations the last 3 bytes
+of an extension contain a magic number. The current reserved magic numbers are:
+
+| Magic | Organization |
+|-------|--------------|
+| `PAR` | Reserved for the future when an extension replaces `PAR1` |
+| `PER` | Reserved for the future when an extension replaces `PARE` |
+| `ASF` | Apache |
+| `AWS` | Amazon |
+| `CDH` | Cloudera |
+| `CRM` | Salesforce |
+| `DBR` | Databricks |
+| `EXP` | Apache/Experimental |

Review Comment:
   It is exclusive because the extension mechanism covers: custom/internal 
extensions, experimental public extensions, and transition period for new 
additions to the standard.
   
   I think an example will make this clearer.
   
   The way I see this playing out is:
   1. different vendors add their own extensions in a way that do not interfere 
with one another. By construction only one extension can be in a file - since a 
writer is owned by one vendor and will write that vendors extension. The 
disambiguator string here so that readers that know about *one* (or more) 
extensions can check if the extension that is present is one they can read. 
Some of these extensions will never be made public.
   2. at some point a vendor say `NVD` wants to make their extension public by 
including it in the standard. At this point there is a formal proposal to 
include it in the format as a first class part of metadata.
   3. the extension is reviewed and accepted - possibly in a form slightly 
different that proposed
   4. the accepted proposal takes magic `PAR` and is now public. For a 
transition period the format contains the new format as an extension so that 
readers have time to be upgraded. This is one of the reasons we pre-reserve 
`PAR` now.
   5. once the transition period is over, the extension verbatim becomes the 
new footer, but now without the thrift field header because it is no longer an 
extension. At this point we also say that the new magic for the parquet file in 
this version is `PAR\0`. This will require very little changes to readers that 
already supported (4).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to