alkis commented on code in PR #254:
URL: https://github.com/apache/parquet-format/pull/254#discussion_r1628333100


##########
README.md:
##########
@@ -285,6 +285,61 @@ There are many places in the format for compatible 
extensions:
 - Encodings: Encodings are specified by enum and more can be added in the 
future.
 - Page types: Additional page types can be added and safely skipped.
 
+### Thrift extensions
+Thrift is used for metadata. The Thrift spec mandates that unknown fields are
+skipped. To facilitate extensions Parquet reserves field-id `32767` of *every*
+struct as an ignorable extension point. More specifically Parquet guarantees
+that field-id `32767` will *never* be seen in the official Thrift IDL. The type
+of this field is always `binary`. 
+
+Such extensions can easily be appended to an existing Thrift serialized message
+without any special APIs. Sample `C++` implementation is provided:
+
+```c++
+std::string AppendExtension(std::string thrift, std::string ext) {
+  auto append_uleb = [](uint32_t x, std::string* out) {
+    while (true) {
+      int c = x & 0x7F;
+      if ((x >>= 7) == 0) {
+        out->push_back(c);
+        return;
+      } else {
+        out->push_back(c | 0x80);
+      }
+    }
+  };
+  thrift.pop_back();  // remove the trailing 0
+  thrift += "\x08\xFF\xFF\x01";
+  append_uleb(ext.size(), &thrift);
+  thrift += ext;
+  thrift += "\x00";  // add the trailing 0 back
+  return thrift;
+}
+```
+
+Additionally the binary extension MUST have a specific form in order to be
+unambiguously identifiable by parsers that know of it and, as a corollary,
+impossible to be accidentally generated by user data.
+
+```
+N bytes: the extension data

Review Comment:
   I am still tempted to leave this out as I find it of little utility and 
unnecessary constraint. What if the designer actually uses some id or version 
but that's better put at the end of the binary blob before the crc of the len? 
Why do we want to force it to the start?
   
   I am adding a paragraph to strongly suggest that implementors add an `id` to 
their extension to make interoperability easier in case it is needed but I 
leave it up to them to incorporate where they deem most appropriate.



##########
README.md:
##########
@@ -285,6 +285,68 @@ There are many places in the format for compatible 
extensions:
 - Encodings: Encodings are specified by enum and more can be added in the 
future.
 - Page types: Additional page types can be added and safely skipped.
 
+### Thrift extensions
+Thrift is used for metadata. The Thrift spec mandates that unknown fields are
+skipped. To facilitate extensions Parquet reserves field-id `32767` of *every*
+struct as an ignorable extension point. More specifically Parquet guarantees
+that field-id `32767` will *never* be seen in the official Thrift IDL. The type
+of this field is always `binary` for maximum extensibility and fast skipping by
+thrift parsers.
+
+Such extensions can easily be appended to an existing Thrift serialized message
+without any special APIs. Sample `C++` implementation is provided:
+
+```c++
+std::string AppendExtension(std::string thrift, std::string ext) {
+  auto append_uleb = [](uint32_t x, std::string* out) {
+    while (true) {
+      int c = x & 0x7F;
+      if ((x >>= 7) == 0) {
+        out->push_back(c);
+        return;
+      } else {
+        out->push_back(c | 0x80);
+      }
+    }
+  };
+  thrift.pop_back();  // remove the trailing 0
+  thrift += "\x08\xFF\xFF\x01";
+  append_uleb(ext.size(), &thrift);
+  thrift += ext;
+  thrift += "\x00";  // add the trailing 0 back
+  return thrift;
+}
+```
+
+Additionally the binary extension MUST have a specific form in order to be
+unambiguously identifiable by parsers that know of it and, as a corollary,
+impossible to be accidentally generated by user data.
+
+```
+N bytes: the extension data
+4 bytes: little endian crc32 of the previous N bytes
+4 bytes: N in little endian 
+4 bytes: little endian crc32 of N
+3 bytes: 3 byte magic extension (after this we have the Thrift stop-field)
+```
+
+The choice of the 3 byte magic is so that the magic plus the `\x00` for the
+stop-field will form a new 4 byte magic which can replace `PAR1` or `PARE` in
+the future when all engines adopt a new format.
+
+Each organization/engine can reserve a magic extension here to avoid clashes.
+To add your extension, file a JIRA and send a PR.
+
+The current list of extensions are
+
+| Magic | Organization |
+|-------|--------------|
+| `PAR` | Reserved for the future when an extension replaces `PAR1` |
+| `PER` | Reserved for the future when an extension replaces `PARE` |
+| `DBR` | Databricks |
+| `CDH` | Cloudera |
+| `ASF` | Apache |

Review Comment:
   Done.



##########
README.md:
##########
@@ -285,6 +285,68 @@ There are many places in the format for compatible 
extensions:
 - Encodings: Encodings are specified by enum and more can be added in the 
future.
 - Page types: Additional page types can be added and safely skipped.
 
+### Thrift extensions
+Thrift is used for metadata. The Thrift spec mandates that unknown fields are
+skipped. To facilitate extensions Parquet reserves field-id `32767` of *every*
+struct as an ignorable extension point. More specifically Parquet guarantees
+that field-id `32767` will *never* be seen in the official Thrift IDL. The type
+of this field is always `binary` for maximum extensibility and fast skipping by
+thrift parsers.
+
+Such extensions can easily be appended to an existing Thrift serialized message
+without any special APIs. Sample `C++` implementation is provided:
+
+```c++
+std::string AppendExtension(std::string thrift, std::string ext) {
+  auto append_uleb = [](uint32_t x, std::string* out) {
+    while (true) {
+      int c = x & 0x7F;
+      if ((x >>= 7) == 0) {
+        out->push_back(c);
+        return;
+      } else {
+        out->push_back(c | 0x80);
+      }
+    }
+  };
+  thrift.pop_back();  // remove the trailing 0
+  thrift += "\x08\xFF\xFF\x01";

Review Comment:
   Done.



##########
README.md:
##########
@@ -285,6 +285,61 @@ There are many places in the format for compatible 
extensions:
 - Encodings: Encodings are specified by enum and more can be added in the 
future.
 - Page types: Additional page types can be added and safely skipped.
 
+### Thrift extensions
+Thrift is used for metadata. The Thrift spec mandates that unknown fields are
+skipped. To facilitate extensions Parquet reserves field-id `32767` of *every*
+struct as an ignorable extension point. More specifically Parquet guarantees
+that field-id `32767` will *never* be seen in the official Thrift IDL. The type
+of this field is always `binary`. 
+
+Such extensions can easily be appended to an existing Thrift serialized message
+without any special APIs. Sample `C++` implementation is provided:
+
+```c++
+std::string AppendExtension(std::string thrift, std::string ext) {
+  auto append_uleb = [](uint32_t x, std::string* out) {
+    while (true) {
+      int c = x & 0x7F;
+      if ((x >>= 7) == 0) {
+        out->push_back(c);
+        return;
+      } else {
+        out->push_back(c | 0x80);
+      }
+    }
+  };
+  thrift.pop_back();  // remove the trailing 0
+  thrift += "\x08\xFF\xFF\x01";
+  append_uleb(ext.size(), &thrift);
+  thrift += ext;
+  thrift += "\x00";  // add the trailing 0 back
+  return thrift;
+}
+```
+
+Additionally the binary extension MUST have a specific form in order to be
+unambiguously identifiable by parsers that know of it and, as a corollary,
+impossible to be accidentally generated by user data.
+
+```
+N bytes: the extension data
+4 bytes: little endian crc32 of the previous N bytes
+4 bytes: N in little endian 
+4 bytes: little endian crc32 of N
+3 bytes: 3 byte magic extension (after this we have the Thrift stop-field)
+```
+
+The choice of the 3 byte magic is so that the magic plus the `\x00` for the
+stop-field will form a new 4 byte magic which can replace `PAR1` or `PARE` in
+the future when all engines adopt a new format.
+
+Each organization/engine can reserve a magic extension here to avoid clashes.
+
+The current list of extensions are:
+- `PAR`: Reserved for the future when an extension replaces `PAR1`
+- `PER`: Reserved for the future when an extension replaces `PARE`
+- `DBR`: Databricks

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to