Re: [I] [EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs [datafusion]

via GitHub Wed, 03 Jul 2024 15:49:16 -0700


efredine commented on issue #10922:
URL: https://github.com/apache/datafusion/issues/10922#issuecomment-2207434579


   Further to the performance discussion @alamb - the StringBuilder pattern you 
suggested in 
https://github.com/apache/datafusion/pull/11136#discussion_r1657725214 does 
seem to materially improve performance:
   
   ```
   Extract data page statistics for String/extract_statistics/String
                           time:   [15.368 µs 15.405 µs 15.446 µs]
                           change: [-68.672% -68.540% -68.409%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     4 (4.00%) high mild
   ```
   So seems like a worthwhile thing to go ahead with? I think there are several 
places where we can do something similar.
   
   One question - I notice in that ticket that you appended nulls for missing 
values. However, I think in most cases, missing values are simply omitted 
because all the None values are removed by flattening. So, in general, users of 
the data page statistics will need to check whether or not the length of the 
array matches the number of actual data pages? This is different from how the 
row group statistics are handled - they will instead have a null value for any 
missing statistics.
   
   Is this difference in behaviour expected or just a side effect of the 
implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs [datafusion]

Reply via email to