[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

mbasmanova Mon, 31 Jul 2017 10:44:36 -0700

Github user mbasmanova commented on the issue:

    https://github.com/apache/spark/pull/18421
  
    [ @gatorsmile ] Is that possible the partition-level row counts is larger 
than the table-level row counts after running this new command?
    
    I think so. If table contents have changed since table-level stats were 
collected and a new very large partition was added or existing one replaced, 
collecting partition-level stats for that one partition will result in a 
partition row-count exceeding table row-count.
    
    I think you are pointing out to a larger issue of staleness of statistics 
which results from table modifications. My thinking is that ANALYZE TABLE 
commands provide the necessary primitives to collect statistics, but a separate 
facility is needed to manage the process of collecting these stats. At a very 
basic level a user would be required to manually run necessary ANALYZE TABLE 
commands when table content is changed, but for larger deployments an automated 
solution could be devised.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

Reply via email to