[jira] [Created] (SPARK-23445) ColumnStat refactoring

Juliusz Sompolski (JIRA) Thu, 15 Feb 2018 18:04:36 -0800

Juliusz Sompolski created SPARK-23445:
-----------------------------------------


             Summary: ColumnStat refactoring
                 Key: SPARK-23445
                 URL: https://issues.apache.org/jira/browse/SPARK-23445
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Juliusz Sompolski


Refactor ColumnStat to be more flexible.
 * Split {{ColumnStat}} and {{CatalogColumnStat}} just like 
{{CatalogStatistics}} is split from {{Statistics}}. This detaches how the 
statistics are stored from how they are processed in the query plan. 
{{CatalogColumnStat}} keeps {{min}} and {{max}} as {{String}}, making it not 
depend on dataType information.
 * For {{CatalogColumnStat}}, parse column names from property names in the 
metastore ({{KEY_VERSION }}property), not from metastore schema. This allows 
the catalog to read stats into {{CatalogColumnStat}}s even if the schema itself 
is not in the metastore.
 * Make all fields optional. {{min}}, {{max}} and {{histogram}} for columns 
were optional already. Having them all optional is more consistent, and gives 
flexibility to e.g. drop some of the fields through transformations if they are 
difficult / impossible to calculate.

The added flexibility will make it possible to have alternative implementations 
for stats, and separates stats collection from stats and estimation processing 
in plans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23445) ColumnStat refactoring

Reply via email to