GitHub user rxin opened a pull request:

    https://github.com/apache/spark/pull/15959

    [SPARK-18522][SQL] Explicit contract for column stats serialization

    ## What changes were proposed in this pull request?
    The current implementation of column stats uses the base64 encoding of the 
internal UnsafeRow format to persist statistics (in table properties in Hive 
metastore). This is an internal format that is not stable across different 
versions of Spark and should NOT be used for persistence. In addition, it would 
be better if statistics stored in the catalog is human readable.
    
    This pull request introduces the following changes:
    
    1. Created a single ColumnStat class to for all data types. All data types 
track the same set of statistics.
    2. Documented clearly what JVM data types are being used to store what data.
    3. Defined a simple Map[String, String] interface for serializing and 
deserializing column stats into/from the catalog.
    4. Rearranged the method/function structure so it is more clear what the 
supported data types are, and also moved how stats are generated into 
ColumnStat class so they are easy to find.
    
    ## How was this patch tested?
    TBD - I haven't updated any of the test cases yet and they would fail 
compilation.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rxin/spark SPARK-18522

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15959.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15959
    
----
commit 861834437330493597767c9b1ba0948dbb3b7960
Author: Reynold Xin <[email protected]>
Date:   2016-11-21T07:44:24Z

    [SPARK-18522][SQL] Explicit contract for column stats serialization

commit a05ab721391ebea98b579e8b37034d4f8e911840
Author: Reynold Xin <[email protected]>
Date:   2016-11-21T07:45:22Z

    Remove ...

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to