[
https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058522#comment-15058522
]
Narine Kokhlikyan commented on SPARK-12325:
-------------------------------------------
Thank you for your generous kindness, [~srowen]. I appreciate it!
> Inappropriate error messages in DataFrame StatFunctions
> --------------------------------------------------------
>
> Key: SPARK-12325
> URL: https://issues.apache.org/jira/browse/SPARK-12325
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.2
> Reporter: Narine Kokhlikyan
> Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL
> component, but I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call dataframe correlation method and it says that covariance is wrong.
> I do not think that this is an appropriate message to show here.
> scala> df.stat.corr("rating", "income")
> java.lang.IllegalArgumentException: requirement failed: Covariance
> calculation for columns with dataType StringType not supported.
> at scala.Predef$.require(Predef.scala:233)
> at
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)
> 2. The biggest issue here is not the message shown, but the design.
> A class called CovarianceCounter does the computations both for correlation
> and covariance. This might be a convenient way
> from certain perspective, however something like this is harder to understand
> and extend, especially if you want to add another algorithm
> e.g. Spearman correlation, or something else.
> There are many possible solutions here:
> starting from
> 1. just fixing the message
> 2. fixing the message and renaming CovarianceCounter and corresponding
> methods
> 3. create CorrelationCounter and splitting the computations for correlation
> and covariance
> and many more ....
> Since I'm not getting any response and according to github all five of you
> have been working on this, I'll try again:
> [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]
> Can any of you ,please, explain me such a behavior with the stat functions or
> communicate more about this ?
> In case you are planning to remove it or something else, we'd truly
> appreciate if you communicate.
> In fact, I would like to do a pull request on this, but since my pull
> requests in SQL/ML components are just staying there without any response,
> I'll wait for your response first.
> cc: [~shivaram], [~mengxr]
> Thank you,
> Narine
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]