zinking commented on PR #37083: URL: https://github.com/apache/spark/pull/37083#issuecomment-1193498331
> > BTW, with CBO off, where do we use row count? > > we use it in places like : > > https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L93-L100 > > where we just multiply row-count with row size. We also use it for BF to create [bloomFilterAgg](https://github.com/apache/spark/blob/13882bd7b80cd89fc4c58bd96a5ef783a0744019/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala#L78-L84). In v1 scenario in case of [logical relation row-count](https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L43-L45) can seep in from catalog stats but as you correctly pointed out it has a has a chance of `row-count` getting lost in places where we assume we only have sizeInBytes for example here : > > https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L58 thought these stats are available in AQE and more accurate though -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
