GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/18847
[SPARK-12717][SPARK-21031][SPARK-21599][SQL][BRANCH-2.2]] Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException ## What changes were proposed in this pull request? Backports the following JIRAs into 2.2. ``` [SPARK-17410][SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl [SPARK-21031][SQL] Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats [SPARK-21599][SQL] Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException ``` ## How was this patch tested? Tests cases added as part of original fix. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark datasource_stat_2.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18847.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18847 ---- commit 707529428a23ffd65c8212a273d12a4df58b39e6 Author: Xiao Li <gatorsm...@gmail.com> Date: 2017-05-23T00:28:30Z [SPARK-17410][SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl ### What changes were proposed in this pull request? After we adding a new field `stats` into `CatalogTable`, we should not expose Hive-specific Stats metadata to `MetastoreRelation`. It complicates all the related codes. It also introduces a bug in `SHOW CREATE TABLE`. The statistics-related table properties should be skipped by `SHOW CREATE TABLE`, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792 Also fix the issue to fill Hive-generated RowCounts to our stats. This PR is to handle Hive-specific Stats metadata in `HiveClientImpl`. ### How was this patch tested? Added a few test cases. Author: Xiao Li <gatorsm...@gmail.com> Closes #14971 from gatorsmile/showCreateTableNew. commit a933350805eda961e41a429317cd3397d159a6fb Author: Zhenhua Wang <wzh_...@163.com> Date: 2017-06-12T00:23:04Z [SPARK-21031][SQL] Add `alterTableStats` to store spark's stats and let `alterTable` keep existing stats ## What changes were proposed in this pull request? Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. As a result, hive's stats can be unexpectedly propagated into spark' stats. For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command. Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (should not exist) over hive's stats. A running example is shown in [JIRA](https://issues.apache.org/jira/browse/SPARK-21031). To fix this, we add a new method `alterTableStats` to store spark's stats, and let `alterTable` keep existing stats. ## How was this patch tested? Added new tests. Author: Zhenhua Wang <wzh_...@163.com> Closes #18248 from wzhfy/separateHiveStats. commit a03e188818b1505383f3487904d62d90519e72c9 Author: Dilip Biswal <dbis...@us.ibm.com> Date: 2017-08-03T16:25:48Z [SPARK-21599][SQL] Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException In case of datasource tables (when they are stored in non-hive compatible way) , the schema information is recorded as table properties in hive meta-store. The alterTableStats method needs to get the schema information from table properties for data source tables before recording the column level statistics. Currently, we don't get the correct schema information and fail with java.util.NoSuchElement exception. A new test case is added in StatisticsSuite. Author: Dilip Biswal <dbis...@us.ibm.com> Closes #18804 from dilipbiswal/datasource_stats. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org