etseidl commented on code in PR #9619:
URL: https://github.com/apache/arrow-rs/pull/9619#discussion_r2998234292


##########
parquet/src/column/writer/mod.rs:
##########
@@ -1383,11 +1418,39 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, 
E> {
 }
 
 fn update_min<T: ParquetValueType>(descr: &ColumnDescriptor, val: &T, min: 
&mut Option<T>) {
-    update_stat::<T, _>(descr, val, min, |cur| compare_greater(descr, cur, 
val))
+    if min.is_none() {

Review Comment:
   Changes here and to `update_max` also worry me. It's just complicated 
because we can't simply exclude NaN. In the case of all-NaN we have to properly 
order them so we get the min and max NaN, but if a non-NaN shows up we have to 
start over. Same thing in `get_min_max()`.
   
   This is why I prefer the other solution of simply using total order and 
dealing with the possibility of NaN in the statistics.



##########
parquet/src/column/writer/encoder.rs:
##########
@@ -325,24 +341,35 @@ impl<T: DataType> ColumnValueEncoder for 
ColumnValueEncoderImpl<T> {
     }
 }
 
+// Get min and max values for all values in `iter`.
+//
+// For floating point we need to compare NaN values until we encounter a 
non-NaN
+// value which then becomes the new min/max. After this, only non-NaN values 
are
+// evaluated. If all values are NaN, then the min/max NaNs as determined by
+// IEEE 754 total order are returned.

Review Comment:
   This has me a bit worried. I need to do some benchmarking to make sure all 
the complicated NaN logic isn't killing performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to