wgtmac commented on PR #34112:
URL: https://github.com/apache/arrow/pull/34112#issuecomment-1433996973

   > > It seems like we do have handling for these two cases.
   > 
   > Just for clarity, the PR I linked (and my thoughts) were about how we 
currently handle row group statistics. I'm not sure if the rules are identical 
for page statistics. I mainly wanted to make sure my understanding of the row 
group statistics wasn't invalid.
   
   IIUC, row group statistics are aggregated from page statistics so they 
should share the same rules. The parquet thrift message definition does allow 
only one side of min or max exist:
   ```thrift
   /**
    * Statistics per row group and per page
    * All fields are optional.
    */
   struct Statistics {
      /**
       * DEPRECATED: min and max value of the column. Use min_value and 
max_value.
       *
       * Values are encoded using PLAIN encoding, except that variable-length 
byte
       * arrays do not include a length prefix.
       *
       * These fields encode min and max values determined by signed comparison
       * only. New files should use the correct order for a column's logical 
type
       * and store the values in the min_value and max_value fields.
       *
       * To support older readers, these may be set when the column order is
       * signed.
       */
      1: optional binary max;
      2: optional binary min;
      /** count of null value in the column */
      3: optional i64 null_count;
      /** count of distinct values occurring */
      4: optional i64 distinct_count;
      /**
       * Min and max values for the column, determined by its ColumnOrder.
       *
       * Values are encoded using PLAIN encoding, except that variable-length 
byte
       * arrays do not include a length prefix.
       */
      5: optional binary max_value;
      6: optional binary min_value;
   }
   ```
   
   On the other side, the story of page index is different. The column index 
definition does require existence of both min and max values if it is not a 
null page:
   ```thrift
   /**
    * Description for ColumnIndex.
    * Each <array-field>[i] refers to the page at OffsetIndex.page_locations[i]
    */
   struct ColumnIndex {
     /**
      * A list of Boolean values to determine the validity of the corresponding
      * min and max values. If true, a page contains only null values, and 
writers
      * have to set the corresponding entries in min_values and max_values to
      * byte[0], so that all lists have the same length. If false, the
      * corresponding entries in min_values and max_values must be valid.
      */
     1: required list<bool> null_pages
   
     /**
      * Two lists containing lower and upper bounds for the values of each page
      * determined by the ColumnOrder of the column. These may be the actual
      * minimum and maximum values found on a page, but can also be (more 
compact)
      * values that do not exist on a page. For example, instead of storing 
""Blart
      * Versenwald III", a writer may set min_values[i]="B", max_values[i]="C".
      * Such more compact values must still be valid values within the column's
      * logical type. Readers must make sure that list entries are populated 
before
      * using them by inspecting null_pages.
      */
     2: required list<binary> min_values
     3: required list<binary> max_values
   
     /**
      * Stores whether both min_values and max_values are ordered and if so, in
      * which direction. This allows readers to perform binary searches in both
      * lists. Readers cannot assume that max_values[i] <= min_values[i+1], even
      * if the lists are ordered.
      */
     4: required BoundaryOrder boundary_order
   
     /** A list containing the number of null values for each page **/
     5: optional list<i64> null_counts
   }
   ```
   
   So I am fine with parsing only one side min or max values from page/row 
group statistics. @westonpace @wjones127 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to