[
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-22201:
---------------------------------
Labels: bulk-closed (was: )
> Dataframe describe includes string columns
> ------------------------------------------
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 2.2.0
> Reporter: cold gin
> Priority: Minor
> Labels: bulk-closed
>
> As per the api documentation, the default no-arg Dataframe describe()
> function should only include numerical column types, but it is including
> String types as well. This creates unusable statistical results (for example,
> max returns "V8903" for one of the string columns in my dataset), and this
> also leads to stacktraces when you run show() on the resulting dataframe
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only
> Int, Double, etc (numeric) columns should be included when generating the
> statistics.
> Perhaps this reveals the need for a new function to produce stats that make
> sense only for string columns, or else an additional parameter to describe()
> to filter in/out certain column types?
> In summary, the *default* describe api behavior (no arg behavior) should not
> include string columns. Note that boolean columns are correctly excluded by
> describe()
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]