GitHub user dilipbiswal opened a pull request:

    https://github.com/apache/spark/pull/18847

    [SPARK-12717][SPARK-21031][SPARK-21599][SQL][BRANCH-2.2]] Collecting column 
statistics for datasource tables may fail with java.util.NoSuchElementException

    ## What changes were proposed in this pull request?
    Backports the following JIRAs into 2.2.
    ```
    [SPARK-17410][SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl
    [SPARK-21031][SQL] Add `alterTableStats` to store spark's stats and let 
`alterTable` keep existing stats
    [SPARK-21599][SQL] Collecting column statistics for datasource tables may 
fail with java.util.NoSuchElementException
    ```
    ## How was this patch tested?
    Tests cases added as part of original fix.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dilipbiswal/spark datasource_stat_2.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18847.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18847
    
----
commit 707529428a23ffd65c8212a273d12a4df58b39e6
Author: Xiao Li <gatorsm...@gmail.com>
Date:   2017-05-23T00:28:30Z

    [SPARK-17410][SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl
    
    ### What changes were proposed in this pull request?
    
    After we adding a new field `stats` into `CatalogTable`, we should not 
expose Hive-specific Stats metadata to `MetastoreRelation`. It complicates all 
the related codes. It also introduces a bug in `SHOW CREATE TABLE`. The 
statistics-related table properties should be skipped by `SHOW CREATE TABLE`, 
since it could be incorrect in the newly created table. See the Hive JIRA: 
https://issues.apache.org/jira/browse/HIVE-13792
    
    Also fix the issue to fill Hive-generated RowCounts to our stats.
    
    This PR is to handle Hive-specific Stats metadata in `HiveClientImpl`.
    ### How was this patch tested?
    
    Added a few test cases.
    
    Author: Xiao Li <gatorsm...@gmail.com>
    
    Closes #14971 from gatorsmile/showCreateTableNew.

commit a933350805eda961e41a429317cd3397d159a6fb
Author: Zhenhua Wang <wzh_...@163.com>
Date:   2017-06-12T00:23:04Z

    [SPARK-21031][SQL] Add `alterTableStats` to store spark's stats and let 
`alterTable` keep existing stats
    
    ## What changes were proposed in this pull request?
    
    Currently, hive's stats are read into `CatalogStatistics`, while spark's 
stats are also persisted through `CatalogStatistics`. As a result, hive's stats 
can be unexpectedly propagated into spark' stats.
    
    For example, for a catalog table, we read stats from hive, e.g. "totalSize" 
and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we 
will store the stats in `CatalogStatistics` into metastore as spark's stats 
(because we don't know whether it's from spark or not). But spark's stats 
should be only generated by "ANALYZE" command. This is unexpected from this 
command.
    
    Secondly, now that we have spark's stats in metastore, after inserting new 
data, although hive updated "totalSize" in metastore, we still cannot get the 
right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats 
(should not exist) over hive's stats.
    
    A running example is shown in 
[JIRA](https://issues.apache.org/jira/browse/SPARK-21031).
    
    To fix this, we add a new method `alterTableStats` to store spark's stats, 
and let `alterTable` keep existing stats.
    
    ## How was this patch tested?
    
    Added new tests.
    
    Author: Zhenhua Wang <wzh_...@163.com>
    
    Closes #18248 from wzhfy/separateHiveStats.

commit a03e188818b1505383f3487904d62d90519e72c9
Author: Dilip Biswal <dbis...@us.ibm.com>
Date:   2017-08-03T16:25:48Z

    [SPARK-21599][SQL] Collecting column statistics for datasource tables may 
fail with java.util.NoSuchElementException
    
    In case of datasource tables (when they are stored in non-hive compatible 
way) , the schema information is recorded as table properties in hive 
meta-store. The alterTableStats method needs to get the schema information from 
table properties for data source tables before recording the column level 
statistics. Currently, we don't get the correct schema information and fail 
with java.util.NoSuchElement exception.
    
    A new test case is added in StatisticsSuite.
    
    Author: Dilip Biswal <dbis...@us.ibm.com>
    
    Closes #18804 from dilipbiswal/datasource_stats.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to