emkornfield commented on code in PR #250:
URL: https://github.com/apache/parquet-format/pull/250#discussion_r1621080671


##########
src/main/thrift/parquet.thrift:
##########
@@ -812,8 +837,20 @@ struct ColumnMetaData {
 
   /** Set of all encodings used for pages in this column chunk.
    * This information can be used to determine if all data pages are
-   * dictionary encoded for example **/
+   * dictionary encoded for example  
+   *
+   *  PAR1: Optional. May be deprecated in a future release in favor
+   *        serialized_encoding_stats.
+   *  PAR3: Don't populate.  Write serialized_page_encoding_stats.
+   **/
   13: optional list<PageEncodingStats> encoding_stats;
+  /** 
+    * Serialized page encoding stats.
+    *
+    * PAR1: Start populating after encoding_stats is deprecated.
+    * PAR3: Populate instead of encoding_stats.
+    */
+  17: optional binary serialized_encoding_stats

Review Comment:
   Yes I think it serves three purposes:
   1.  consolidate encodings.
   2. Determine optimizations available in the engine (e.g. fully dictionary 
encoded, fully RLE, etc)
   3. Instrumentation for engines to understand what the distribution of data 
they see to prioritize optimizations for specific encodings.
   
   I'm OK removing completely but my concern around #2 is spec changes take a 
while to propagate, having the information for engines to determine what 
optimizations they want to do without requiring a new flag I think is useful.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to