Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/10388#discussion_r48199539
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala ---
@@ -194,4 +194,45 @@ class DataFrameNaFunctionsSuite extends QueryTest with
SharedSQLContext {
assert(out1(4) === Row("Amy", null, null))
assert(out1(5) === Row(null, null, null))
}
+
+ test("Spark-12231: dropna with partitionBy and groupBy") {
+ withTempPath { dir =>
+ val df = sqlContext.range(10)
+ val df1 = df.withColumn("a", $"id".cast("int"))
+ df1.write.partitionBy("id").parquet(dir.getCanonicalPath)
+ val df2 = sqlContext.read.parquet(dir.getCanonicalPath)
+ val group = df2.na.drop().groupBy().count().collect()
--- End diff --
Instead of just seeing the stack trace you get this:
```
[info] - Spark-12231: dropna with partitionBy and groupBy *** FAILED *** (1
second, 773 milliseconds)
[info] Exception thrown while executing query:
[info] == Parsed Logical Plan ==
[info] Aggregate [(count(1),mode=Complete,isDistinct=false) AS count#72L]
[info] +- Filter AtLeastNNulls(n, a#70,id#71)
[info] +- Relation[a#70,id#71] ParquetRelation
[info]
[info] == Analyzed Logical Plan ==
[info] count: bigint
[info] Aggregate [(count(1),mode=Complete,isDistinct=false) AS count#72L]
[info] +- Filter AtLeastNNulls(n, a#70,id#71)
[info] +- Relation[a#70,id#71] ParquetRelation
[info]
[info] == Optimized Logical Plan ==
[info] Aggregate [(count(1),mode=Complete,isDistinct=false) AS count#72L]
[info] +- Project
[info] +- Filter AtLeastNNulls(n, a#70,id#71)
[info] +- Relation[a#70,id#71] ParquetRelation
[info]
[info] == Physical Plan ==
[info] TungstenAggregate(key=[],
functions=[(count(1),mode=Final,isDistinct=false)], output=[count#72L])
[info] +- TungstenExchange SinglePartition, None
[info] +- TungstenAggregate(key=[],
functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#75L])
[info] +- !Filter AtLeastNNulls(n, a#70,id#71)
[info] +- Scan ParquetRelation[] InputPaths:
file:/Users/marmbrus/workspace/spark/target/tmp/spark-28f5676f-6232-440f-8753-60f6e1aacc26
[info] == Exception ==
[info] org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in
stage 14.0 (TID 25, localhost):
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding
attribute, tree: a#70
...
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]