Re: [PR] Extract Parquet statistics from `Interval` column [datafusion]

via GitHub Wed, 05 Jun 2024 10:22:55 -0700


alamb commented on code in PR #10801:
URL: https://github.com/apache/datafusion/pull/10801#discussion_r1628143605



##########
datafusion/core/tests/parquet/mod.rs:
##########
@@ -925,6 +932,71 @@ fn make_dict_batch() -> RecordBatch {
     .unwrap()
 }
 
+fn make_interval_batch(offset: i32) -> RecordBatch {
+    let schema = Schema::new(vec![
+        Field::new(
+            "year_month",
+            DataType::Interval(IntervalUnit::YearMonth),
+            true,
+        ),
+        Field::new("day_time", DataType::Interval(IntervalUnit::DayTime), 
true),
+        Field::new(
+            "month_day_nano",
+            DataType::Interval(IntervalUnit::MonthDayNano),
+            true,
+        ),
+    ]);
+    let schema = Arc::new(schema);
+
+    let ym_arr = IntervalYearMonthArray::from(vec![
+        Some(IntervalYearMonthType::make_value(1 + offset, 1 + offset)),

Review Comment:
   in general I suggest changing this so the values of the two fields are 
different (so that it would catch bugs where the fields weren't properly 
interpreted)
   
   For example, instead of
   
   ```rust
           Some(IntervalYearMonthType::make_value(1 + offset, 1 + offset)),
   ```
   
   Something like (use `10 + offset` in the second field so the values are 
different)
   
   ```rust
           Some(IntervalYearMonthType::make_value(1 + offset, 10 + offset)),
   ```
   
   The same applies to the rest of the values in this code



##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -256,6 +259,13 @@ macro_rules! get_statistic {
                     Some(DataType::Float16) => {
                         
Some(ScalarValue::Float16(from_bytes_to_f16(s.$bytes_func())))
                     }
+                    Some(DataType::Interval(unit)) => {
+                        match unit {
+                            IntervalUnit::YearMonth => 
unimplemented!("Interval statistics not yet supported by parquet"),

Review Comment:
   in general, in rust `unimplemented!()` results in a panic which is not a 
great user experience.
   
   I think this code purposely ignores errors (in order to gracefully handle 
parquet files that might not have the expected statistics or that were created 
from some other writer) 
   
   Thus, I suggest changing these cases from `panic` to `None` (and then 
adjusting the test appropriately)
   
   If we return `None`, once statistics are properly stored by the parquet-rs 
writer, the test will fail on next upgrade and we can update the test with the 
correct values



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Extract Parquet statistics from `Interval` column [datafusion]

Reply via email to