gene-bordegaray commented on issue #19973:
URL: https://github.com/apache/datafusion/issues/19973#issuecomment-4092787336
> What if instead we:
>
> 1. Fix the propagation bugs for `byte_size` (which benefits all downstream
consumers, not just avg size)
> 2. Add a helper method like `fn avg_byte_size(&self, num_rows:
Precision<usize>) -> Precision<usize>` so callers get the derived value
conveniently
>
> This avoids growing `ColumnStatistics` while still making avg byte size
easy to use. What do you think?
Had some time to read this and think about it a little more. I think this is
true and the right direction to go if statistics continue to be an priority in
the community since yes `avg_byte_size` can just be recomputed using `byte_size
/ num_rows`, but does rely on the fact that our stats propagation is good.
My original motivation for keeping it stored as a column is this scenario:
```text
Helper-only:
- Scan:
- num_rows = 1000000
- byte_size = 1000000000
- Filter:
- num_rows = 10
- byte_size gets dropped or becomes stale
- Result:
- helper can no longer give a useful avg
Stored avg_byte_size
- Scan:
- num_rows = 1000000
- byte_size = 1000000000
- avg_byte_size = 1000
- Filter:
- num_rows = 10
- byte_size gets dropped or becomes stale
- avg_byte_size = 1000
- Result:
- downstream can still estimate byte_size using: 10 * 1000 = 10000
```
As seen it is very easy to propagate the avg_byte_size compared to the
byte_size since it is derived once at scan then naturally passes through
filters, joins, limits, etc.
The stability of the avg make seems like a promising way to deal with losing
stats, this could justify keeping it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]