[GitHub] [spark] singhpk234 commented on pull request #37083: [SPARK-39678][SQL] Improve stats estimation for v2 tables


singhpk234 commented on PR #37083:
URL: https://github.com/apache/spark/pull/37083#issuecomment-1174817038

> Could you enable spark.sql.cbo.enabled to estimate row count?

Thanks @wangyum, I am aware of the alternate visitor we use with cbo.

I raised this pr considering :
1. cbo is turned off by default.
2. We already have rowCount propagated via LeafNodes (DSv2Relation) which
are used for estimating output size in SizeInBytesOnlyStatsPlanVisitor

https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L93-L100
3. ANALYZE is not supported for v2 tables so except row count, IMHO we can't
have ndv etc. I am refering to this jira :
https://issues.apache.org/jira/browse/SPARK-39420
4. As per my understanding v1 tables can only pass in sizeInBytes unless
they have some stats in catalog. whereas v2 tables already give both from the
relation itself, hence I thought it's un-accounted for v2 tables.

https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L43-L45

Are you recommending it's an expected behavior / by design ?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to