adriangb commented on issue #8078: URL: https://github.com/apache/datafusion/issues/8078#issuecomment-3382885667
I gave it a shot in https://github.com/apache/datafusion/pull/17980. I decided to create a new struct to replace `ColumnStatistics` and a new struct to replace `Statistics` and added them as fields to `PartitionedFile`, making it backwards compatible to at least experiment with the new thing. A couple interesting things I found: 1. Have we considered `Precision<Distribution>`? It seems to nicely encapsulate what I think we want. 2. I included and think we want several other public fields, e.g. I think `total_byte_size: Precision<usize>` is sensible and we should keep it. The main thing to replace was `min: Precision<ScalarValue>` and `max: Precision<ScalarValue>`. I kept `sum: Precision<ScalarValue>` but feel that maybe that could just be made to be a field of the `Distribution` variants? @berkaysynnada any thoughts? 3. I couldn't put the new structs alongside the old ones because `datafusion/common` does not depend on `datafusion/expr-common`. Seems like we might need to do a bit of juggling if we want to access `Distribution` from the current location. 4. I don't love the current reperesentation `Vec<ColumnStatistics>` for several reasons: (1) for wide tables / scans having empty `ColumnStatistics` takes up a lot of memory. We had a situation where we essentially initialized every file in the table (say 10k) and initialized empty stats for each column (say 100) so we ended up with a couple hunded MB of empty stats; (2) it depends on the projection so it's somewhat hard to tie statistics to the column by name; (3) when you apply a projection you have to depp clone all of the statistics. I tried to address (1) and (3) by changing the representation to `Vec<Option<Arc<ColumnDistributionStatistics>>` which makes it cheaper to initailize as empty and cheaper to project. But that doesn't address (2). Should we also include the table schema for each file? It's just one more `Arc'd` field. Then we could have helpers to get projected statistics for certain columns or certain indices, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
