alamb commented on pull request #512: URL: https://github.com/apache/arrow-rs/pull/512#issuecomment-871717610
> May you please check if this would be useful. I've left the distinct count as None as we'd need an arrow::compute kernel that does a distinct count. Thanks for this PR @nevi-me ! In IOx we often would already have the `min`, `max`, `null_count` (and sometimes `distinct_count`) values for data we are saving to parquet, so being able to supply them somehow to the writer would be great. If using the arrow compute kernels to compute the statistics is faster than doing it row by that seems like a win too from my perspective. > @Dandandan @jorgecarleitao I'd expect such to already exist in datafusion, so would simply porting it to arrow::compute work? DataFusion computes distinct counts using the code in https://github.com/apache/arrow-datafusion/blob/9cf32cf2cda8472b87130142c4eee1126d4d9cbe/datafusion/src/physical_plan/distinct_expressions.rs#L45 -- it would need some finagling to make into an arrow::compute::kernel I think but could be done cc @crepererum -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
