GitHub user dilipbiswal opened a pull request:
https://github.com/apache/spark/pull/18847
[SPARK-12717][SPARK-21031][SPARK-21599][SQL][BRANCH-2.2]] Collecting column
statistics for datasource tables may fail with java.util.NoSuchElementException
## What changes were proposed in this pull request?
Backports the following JIRAs into 2.2.
```
[SPARK-17410][SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl
[SPARK-21031][SQL] Add `alterTableStats` to store spark's stats and let
`alterTable` keep existing stats
[SPARK-21599][SQL] Collecting column statistics for datasource tables may
fail with java.util.NoSuchElementException
```
## How was this patch tested?
Tests cases added as part of original fix.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dilipbiswal/spark datasource_stat_2.2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18847.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18847
----
commit 707529428a23ffd65c8212a273d12a4df58b39e6
Author: Xiao Li <[email protected]>
Date: 2017-05-23T00:28:30Z
[SPARK-17410][SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl
### What changes were proposed in this pull request?
After we adding a new field `stats` into `CatalogTable`, we should not
expose Hive-specific Stats metadata to `MetastoreRelation`. It complicates all
the related codes. It also introduces a bug in `SHOW CREATE TABLE`. The
statistics-related table properties should be skipped by `SHOW CREATE TABLE`,
since it could be incorrect in the newly created table. See the Hive JIRA:
https://issues.apache.org/jira/browse/HIVE-13792
Also fix the issue to fill Hive-generated RowCounts to our stats.
This PR is to handle Hive-specific Stats metadata in `HiveClientImpl`.
### How was this patch tested?
Added a few test cases.
Author: Xiao Li <[email protected]>
Closes #14971 from gatorsmile/showCreateTableNew.
commit a933350805eda961e41a429317cd3397d159a6fb
Author: Zhenhua Wang <[email protected]>
Date: 2017-06-12T00:23:04Z
[SPARK-21031][SQL] Add `alterTableStats` to store spark's stats and let
`alterTable` keep existing stats
## What changes were proposed in this pull request?
Currently, hive's stats are read into `CatalogStatistics`, while spark's
stats are also persisted through `CatalogStatistics`. As a result, hive's stats
can be unexpectedly propagated into spark' stats.
For example, for a catalog table, we read stats from hive, e.g. "totalSize"
and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we
will store the stats in `CatalogStatistics` into metastore as spark's stats
(because we don't know whether it's from spark or not). But spark's stats
should be only generated by "ANALYZE" command. This is unexpected from this
command.
Secondly, now that we have spark's stats in metastore, after inserting new
data, although hive updated "totalSize" in metastore, we still cannot get the
right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats
(should not exist) over hive's stats.
A running example is shown in
[JIRA](https://issues.apache.org/jira/browse/SPARK-21031).
To fix this, we add a new method `alterTableStats` to store spark's stats,
and let `alterTable` keep existing stats.
## How was this patch tested?
Added new tests.
Author: Zhenhua Wang <[email protected]>
Closes #18248 from wzhfy/separateHiveStats.
commit a03e188818b1505383f3487904d62d90519e72c9
Author: Dilip Biswal <[email protected]>
Date: 2017-08-03T16:25:48Z
[SPARK-21599][SQL] Collecting column statistics for datasource tables may
fail with java.util.NoSuchElementException
In case of datasource tables (when they are stored in non-hive compatible
way) , the schema information is recorded as table properties in hive
meta-store. The alterTableStats method needs to get the schema information from
table properties for data source tables before recording the column level
statistics. Currently, we don't get the correct schema information and fail
with java.util.NoSuchElement exception.
A new test case is added in StatisticsSuite.
Author: Dilip Biswal <[email protected]>
Closes #18804 from dilipbiswal/datasource_stats.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]