Github user mbasmanova commented on the issue:
https://github.com/apache/spark/pull/18421
[ @gatorsmile ] Is that possible the partition-level row counts is larger
than the table-level row counts after running this new command?
I think so. If table contents have changed since table-level stats were
collected and a new very large partition was added or existing one replaced,
collecting partition-level stats for that one partition will result in a
partition row-count exceeding table row-count.
I think you are pointing out to a larger issue of staleness of statistics
which results from table modifications. My thinking is that ANALYZE TABLE
commands provide the necessary primitives to collect statistics, but a separate
facility is needed to manage the process of collecting these stats. At a very
basic level a user would be required to manually run necessary ANALYZE TABLE
commands when table content is changed, but for larger deployments an automated
solution could be devised.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]