cold gin created SPARK-22201:
--------------------------------

             Summary: Dataframe describe includes string columns
                 Key: SPARK-22201
                 URL: https://issues.apache.org/jira/browse/SPARK-22201
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.2.0
            Reporter: cold gin


As per the api documentation, the default no-arg Dataframe describe() function 
should only include numerical column types, but it is including String types as 
well. This creates unusable statistical results (for example, max returns 
"V8903" for one of the string columns in my dataset).

There also appears to be several related issues to this:

https://issues.apache.org/jira/browse/SPARK-16468

https://issues.apache.org/jira/browse/SPARK-16429

But SPARK-16429 does not make sense with what the default api says, and only 
Int, Double, etc (numeric) columns should be included when generating the 
statistics. 

Perhaps this reveals the need for a new function to produce stats that make 
sense only for string columns, or else an additional parameter to describe() to 
filter in/out certain column types? 

In summary, the *default* describe api behavior (no arg behavior) should not 
include string columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to