[ 
https://issues.apache.org/jira/browse/SPARK-18505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18505:
--------------------------------
    Description: 
I'm spending more time at the design & code level for cost-based optimizer now, 
and have found a number of issues related to maintainability and compatibility 
that I will like to address.

This is a small pull request to clean up AnalyzeColumnCommand:

1. Removed warning on duplicated columns. Warnings in log messages are useless 
since most users that run SQL don't see them.
2. Removed the nested updateStats function, by just inlining the function.
3. Renamed a few functions to better reflect what they do.
4. Removed the factory apply method for ColumnStatStruct. It is a bad pattern 
to use a apply method that returns an instantiation of a class that is not of 
the same type (ColumnStatStruct.apply used to return CreateNamedStruct).
5. Renamed ColumnStatStruct to just AnalyzeColumnCommand.
6. Added more documentation explaining some of the non-obvious return types and 
code blocks.

In follow-up pull requests, I'd like to address the following:

1. Get rid of the Map[String, ColumnStat] map, since internally we should be 
using Attribute to reference columns, rather than strings.
2. Decouple the fields exposed by ColumnStat and internals of Spark SQL's 
execution path. Currently the two are coupled because ColumnStat takes in an 
InternalRow.
3. Correctness: Remove code path that stores statistics in the catalog using 
the base64 encoding of the UnsafeRow format, which is not stable across Spark 
versions.
4. Clearly document the data representation stored in the catalog for 
statistics.


> Simplify AnalyzeColumnCommand
> -----------------------------
>
>                 Key: SPARK-18505
>                 URL: https://issues.apache.org/jira/browse/SPARK-18505
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>
> I'm spending more time at the design & code level for cost-based optimizer 
> now, and have found a number of issues related to maintainability and 
> compatibility that I will like to address.
> This is a small pull request to clean up AnalyzeColumnCommand:
> 1. Removed warning on duplicated columns. Warnings in log messages are 
> useless since most users that run SQL don't see them.
> 2. Removed the nested updateStats function, by just inlining the function.
> 3. Renamed a few functions to better reflect what they do.
> 4. Removed the factory apply method for ColumnStatStruct. It is a bad pattern 
> to use a apply method that returns an instantiation of a class that is not of 
> the same type (ColumnStatStruct.apply used to return CreateNamedStruct).
> 5. Renamed ColumnStatStruct to just AnalyzeColumnCommand.
> 6. Added more documentation explaining some of the non-obvious return types 
> and code blocks.
> In follow-up pull requests, I'd like to address the following:
> 1. Get rid of the Map[String, ColumnStat] map, since internally we should be 
> using Attribute to reference columns, rather than strings.
> 2. Decouple the fields exposed by ColumnStat and internals of Spark SQL's 
> execution path. Currently the two are coupled because ColumnStat takes in an 
> InternalRow.
> 3. Correctness: Remove code path that stores statistics in the catalog using 
> the base64 encoding of the UnsafeRow format, which is not stable across Spark 
> versions.
> 4. Clearly document the data representation stored in the catalog for 
> statistics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to