[GitHub] [arrow-rs] alamb commented on pull request #512: Pre-compute parquet stats in arrow writer

GitBox Wed, 30 Jun 2021 13:49:32 -0700


alamb commented on pull request #512:
URL: https://github.com/apache/arrow-rs/pull/512#issuecomment-871717610



   > May you please check if this would be useful. I've left the distinct count 
as None as we'd need an arrow::compute kernel that does a distinct count.
   
   Thanks for this PR @nevi-me ! 
   
   In IOx we often would already have the `min`, `max`, `null_count` (and 
sometimes `distinct_count`) values for data we are saving to parquet, so being 
able to supply them somehow to the writer would be great. 
   
   If using the arrow compute kernels to compute the statistics is faster than 
doing it row by that seems like a win too from my perspective.
   
   
   > @Dandandan @jorgecarleitao I'd expect such to already exist in datafusion, 
so would simply porting it to arrow::compute work?
   
   DataFusion computes distinct counts using the code in 
https://github.com/apache/arrow-datafusion/blob/9cf32cf2cda8472b87130142c4eee1126d4d9cbe/datafusion/src/physical_plan/distinct_expressions.rs#L45
 -- it would need some finagling to make into an arrow::compute::kernel I think 
but could be done
   
   cc @crepererum 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb commented on pull request #512: Pre-compute parquet stats in arrow writer

Reply via email to