cold gin created SPARK-22201:
--------------------------------
Summary: Dataframe describe includes string columns
Key: SPARK-22201
URL: https://issues.apache.org/jira/browse/SPARK-22201
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.2.0
Reporter: cold gin
As per the api documentation, the default no-arg Dataframe describe() function
should only include numerical column types, but it is including String types as
well. This creates unusable statistical results (for example, max returns
"V8903" for one of the string columns in my dataset).
There also appears to be several related issues to this:
https://issues.apache.org/jira/browse/SPARK-16468
https://issues.apache.org/jira/browse/SPARK-16429
But SPARK-16429 does not make sense with what the default api says, and only
Int, Double, etc (numeric) columns should be included when generating the
statistics.
Perhaps this reveals the need for a new function to produce stats that make
sense only for string columns, or else an additional parameter to describe() to
filter in/out certain column types?
In summary, the *default* describe api behavior (no arg behavior) should not
include string columns.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]