gene-bordegaray commented on issue #19973:
URL: https://github.com/apache/datafusion/issues/19973#issuecomment-4092787336

   > What if instead we:
   > 
   > 1. Fix the propagation bugs for `byte_size` (which benefits all downstream 
consumers, not just avg size)
   > 2. Add a helper method like `fn avg_byte_size(&self, num_rows: 
Precision<usize>) -> Precision<usize>` so callers get the derived value 
conveniently
   > 
   > This avoids growing `ColumnStatistics` while still making avg byte size 
easy to use. What do you think?
   
   Had some time to read this and think about it a little more. I think this is 
true and the right direction to go if statistics continue to be an priority in 
the community since yes `avg_byte_size` can just be recomputed using `byte_size 
/ num_rows`, but does rely on the fact that our stats propagation is good.
   
   My original motivation for keeping it stored as a column is this scenario:
   ```text
   Helper-only:
       - Scan:
           - num_rows = 1000000
           - byte_size = 1000000000
       - Filter:
           - num_rows = 10
           - byte_size gets dropped or becomes stale
       - Result:
           - helper can no longer give a useful avg
   
   Stored avg_byte_size
       - Scan:
           - num_rows = 1000000
           - byte_size = 1000000000
           - avg_byte_size = 1000
       - Filter:
           - num_rows = 10
           - byte_size gets dropped or becomes stale
           - avg_byte_size = 1000
       - Result:
           - downstream can still estimate byte_size using: 10 * 1000 = 10000
   ```
   
   As seen it is very easy to propagate the avg_byte_size compared to the 
byte_size since it is derived once at scan then naturally passes through 
filters, joins, limits, etc.
   
   The stability of the avg make seems like a promising way to deal with losing 
stats, this could justify keeping it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to