[GitHub] [spark] zinking commented on pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables


zinking commented on PR #37083:
URL: https://github.com/apache/spark/pull/37083#issuecomment-1193498331


   > > BTW, with CBO off, where do we use row count?
   > 
   > we use it in places like :
   > 
   > 
https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L93-L100
   > 
   > where we just multiply row-count with row size. We also use it for BF to 
create 
[bloomFilterAgg](https://github.com/apache/spark/blob/13882bd7b80cd89fc4c58bd96a5ef783a0744019/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala#L78-L84).
 In v1 scenario in case of [logical relation 
row-count](https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L43-L45)
 can seep in from catalog stats but as you correctly pointed out it has a has a 
chance of `row-count` getting lost in places where we assume we only have 
sizeInBytes for example here :
   > 
   > 
https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L58
   
   thought these stats are available in AQE and more accurate though


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zinking commented on pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables

Reply via email to