GitHub user rxin opened a pull request:
https://github.com/apache/spark/pull/15933
[SPARK-18505][SQL] Simplify AnalyzeColumnCommand
## What changes were proposed in this pull request?
I'm spending more time at the design & code level for cost-based optimizer
now, and have found a number of issues related to maintainability and
compatibility that I will like to address.
This is a small pull request to clean up AnalyzeColumnCommand:
1. Removed warning on duplicated columns. Warnings in log messages are
useless since most users that run SQL don't see them.
2. Removed the nested updateStats function, by just inlining the function.
3. Renamed a few functions to better reflect what they do.
4. Removed the factory apply method for ColumnStatStruct. It is a bad
pattern to use a apply method that returns an instantiation of a class that is
not of the same type (ColumnStatStruct.apply used to return CreateNamedStruct).
5. Renamed ColumnStatStruct to just AnalyzeColumnCommand.
6. Added more documentation explaining some of the non-obvious return types
and code blocks.
In follow-up pull requests, I'd like to address the following:
1. Get rid of the Map[String, ColumnStat] map, since internally we should
be using Attribute to reference columns, rather than strings.
2. Decouple the fields exposed by ColumnStat and internals of Spark SQL's
execution path. Currently the two are coupled because ColumnStat takes in an
InternalRow.
3. Correctness: Remove code path that stores statistics in the catalog
using the base64 encoding of the UnsafeRow format, which is not stable across
Spark versions.
4. Clearly document the data representation stored in the catalog for
statistics.
## How was this patch tested?
Affected test cases have been updated.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rxin/spark SPARK-18505
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15933.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15933
----
commit 1a713fd71040fd5a66483141611675847ce5ff29
Author: Reynold Xin <[email protected]>
Date: 2016-11-18T21:52:31Z
[SPARK-18505][SQL] Simplify AnalyzeColumnCommand
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]