[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

Mitesh (JIRA) Thu, 18 May 2017 06:48:19 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772
 ]


Mitesh edited comment on SPARK-17867 at 5/18/17 1:47 PM:
---------------------------------------------------------

I'm seeing a regression from this change, the last filter gets pushed down past 
the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
    val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
      .toDF("userid", "eventid", "vk", "del")
      .filter("userid is not null and eventid is not null and vk is not null")
      .repartitionByColumns(Seq("userid"))
      .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
      .dropDuplicates("eventid")
      .filter("userid is not null")
      .repartitionByColumns(Seq("userid")).
      sortWithinPartitions(asc("userid"))
      .filter("del <> 'hi'")

    // filter should not be pushed down to the local table scan
    df.queryExecution.sparkPlan.collect {
      case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
        assert(false, s"$f was pushed down to $t")
{code}


was (Author: masterddt):
I'm seeing a regression from this change, the last filter gets pushed down past 
the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:scala}
    val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
      .toDF("userid", "eventid", "vk", "del")
      .filter("userid is not null and eventid is not null and vk is not null")
      .repartitionByColumns(Seq("userid"))
      .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
      .dropDuplicates("eventid")
      .filter("userid is not null")
      .repartitionByColumns(Seq("userid")).
      sortWithinPartitions(asc("userid"))
      .filter("del <> 'hi'")

    // filter should not be pushed down to the local table scan
    df.queryExecution.sparkPlan.collect {
      case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
        assert(false, s"$f was pushed down to $t")
{code}

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-17867
>                 URL: https://issues.apache.org/jira/browse/SPARK-17867
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Liang-Chi Hsieh
>            Assignee: Liang-Chi Hsieh
>             Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

Reply via email to