Cheolsoo Park created SPARK-8908:
------------------------------------
Summary: Calling distinct() with parentheses throws error in Scala
DataFrame
Key: SPARK-8908
URL: https://issues.apache.org/jira/browse/SPARK-8908
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.4.0, 1.5.0
Reporter: Cheolsoo Park
Priority: Minor
To reproduce, please call {{distinct()}} on DataFrame in spark-shell. For eg,
{code}
scala> sqlContext.table("my_table").distinct()
<console>:19: error: not enough arguments for method apply: (colName:
String)org.apache.spark.sql.Column in class DataFrame.
Unspecified value parameter colName.
{code}
This is confusing because {{distinct}} in DataFrame is an alias of
{{dropDuplicates}}, and both {{dropDuplicates}} and {{dropDuplicates()}} work.
Here is the summary-
||Scala code||Works||
|DF.distinct|Y|
|DF.distinct()|N|
|DF.dropDuplicates|Y|
|DF.dropDuplicates()|Y|
Looking at the definition of {{distinct}}, it's missing {{()}}-
{code}
override def distinct: DataFrame = dropDuplicates()
{code}
As a result, what seems happening is as follows-
{code}
distinct()
=> dropDuplicates()()
=> DataFrame() // because dropDuplicates() returns DF
=> DataFrame.apply() // fails because apply() takes a column parameter
{code}
I can verify that adding {{()}} to the definition makes both {{distinct}} and
{{distinct()}} work.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]