[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772 ]
Mitesh edited comment on SPARK-17867 at 5/18/17 1:47 PM: --------------------------------------------------------- I'm seeing a regression from this change, the last filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} was (Author: masterddt): I'm seeing a regression from this change, the last filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:scala} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > ---------------------------------------------------------------------------------------- > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Liang-Chi Hsieh > Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org