[jira] [Updated] (SPARK-22201) Dataframe describe includes string columns

Hyukjin Kwon (JIRA) Mon, 20 May 2019 21:20:43 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-22201:
---------------------------------
    Labels: bulk-closed  (was: )

> Dataframe describe includes string columns
> ------------------------------------------
>
>                 Key: SPARK-22201
>                 URL: https://issues.apache.org/jira/browse/SPARK-22201
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: cold gin
>            Priority: Minor
>              Labels: bulk-closed
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-22201) Dataframe describe includes string columns

Reply via email to