alkis commented on code in PR #250:
URL: https://github.com/apache/parquet-format/pull/250#discussion_r1620467287


##########
README.md:
##########
@@ -118,6 +118,51 @@ chunks they are interested in.  The columns chunks should 
then be read sequentia
 
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
 
+ ### PAR3 File Footers
+
+ PAR3 file footer footer format designed to better support wider-schemas and 
more control
+ over the various footer size vs compute trade-offs.  Its format is as follows:
+   - Serialized Thrift FileMetadata Structure
+   - (Optional) 4 byte CRC32 of the serialized Thrift FileMetadata.
+   - 4-byte length in bytes (little endian) of all preceding elements in the 
footer.

Review Comment:
   I recommend a crc32 of the length itself. This is to detect early that a 
footer is corrupt and avoid reading some epic amount of garbage from the end of 
the file. For example think of a bit flip in one of the top bits of the length, 
it will cause a reader to read 100s of MBs of the end of the file only to check 
that the crc doesn't match.



##########
src/main/thrift/parquet.thrift:
##########
@@ -1115,6 +1189,34 @@ union EncryptionAlgorithm {
   2: AesGcmCtrV1 AES_GCM_CTR_V1
 }
 
+/**
+ * Embedded metadata page.
+ * 
+ * A metadata page is a data page used to store metadata about
+ * the data stored in the file. This is a key feature of PAR3
+ * footers which allow for deferred decoding of metadata.
+ *
+ * For common use cases the current recommendation is to use a 
+ * an encoding that supported random access but implementations may choose
+ * other configuration parameters if necessary. Implementations
+ * SHOULD consider allowing configurability per page to allow for end-users
+ * to optimize size vs compute trade-offs that make sense for their use-case.
+ *
+ * Statistics for Metadata pages SHOULD NOT be written.
+ *
+ * Structs of this type should never be written in PAR1.
+ */
+struct MetadataPage {
+   // A serialized page including metadata thrift header and data.
+   1: required binary page
+   // Optional compression applied to the page.
+   2: optional CompressionCodec compression
+   // Number of elements stored.  This is duplicated here to help in 
+   // use-cases where knowing the total number of elements up front for
+   // computation would be useful.
+   3: num_values

Review Comment:
   missing type?



##########
README.md:
##########
@@ -118,6 +118,51 @@ chunks they are interested in.  The columns chunks should 
then be read sequentia
 
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
 
+ ### PAR3 File Footers
+
+ PAR3 file footer footer format designed to better support wider-schemas and 
more control
+ over the various footer size vs compute trade-offs.  Its format is as follows:
+   - Serialized Thrift FileMetadata Structure
+   - (Optional) 4 byte CRC32 of the serialized Thrift FileMetadata.
+   - 4-byte length in bytes (little endian) of all preceding elements in the 
footer.
+   - 4-byte little-endian flag field to indicate features that require special 
parsing of the footer.
+     Readers MUST raise an error if there is an unrecognized flag.  Current 
flags:
+
+     * 0x01 - Footer encryption enabled (when set the encryption information 
is written before 
+        FileMeta structure as in the PAR1 footer).
+     * 0x02 - CRC32 of FileMetadata Footer.

Review Comment:
   I don't find this flag particularly elegant. There is already a magic 
number. We can employ the same mechanism as PAR1/PARE had to distinguish an 
encrypted footer.



##########
README.md:
##########
@@ -118,6 +118,51 @@ chunks they are interested in.  The columns chunks should 
then be read sequentia
 
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
 
+ ### PAR3 File Footers
+
+ PAR3 file footer footer format designed to better support wider-schemas and 
more control
+ over the various footer size vs compute trade-offs.  Its format is as follows:
+   - Serialized Thrift FileMetadata Structure
+   - (Optional) 4 byte CRC32 of the serialized Thrift FileMetadata.

Review Comment:
   This should not be optional.
   
   All bytes are not equal in a file. In particular footer bytes are very 
important because if those are corrupt - we can't read any bytes of the file. 
If anything Footers not having a required checksum for their content is a 
design flaw of the original parquet specification.



##########
src/main/thrift/parquet.thrift:
##########
@@ -792,12 +806,23 @@ struct ColumnMetaData {
   6: required i64 total_uncompressed_size
 
   /** total byte size of all compressed, and potentially encrypted, pages 
-   *  in this column chunk (including the headers) **/
+   *  in this column chunk (including the headers) 
+   * 
+   *  Fetching the range of min(dictionary_page_offset, data_page_offset) 
+   *  + total_compressed_size should fetch all data in the the given column 
+   * chunk.
+   */
   7: required i64 total_compressed_size
 
-  /** Optional key/value metadata **/
+  /** Optional key/value metadata 
+    * PAR1: Optional. 
+    * PAR3: Don't write use key_value_metadata instead.

Review Comment:
   typo?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to