[ https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213361#comment-15213361 ]
koert kuipers commented on SPARK-13531: --------------------------------------- before the optimization both MapPartitions have output type IntegerType. After EliminateSerialization the output type of the first MapPartitions is ObjectType(interface org.apache.spark.sql.Row) and for the second it still is IntegerType as it should be. what is not clear to me is why CanBroadCast.unapply is apparently applied to the first MapPartitions instead of the second. i am not familiar enough with the code for this. > Some DataFrame joins stopped working with UnsupportedOperationException: No > size estimation available for objects > ----------------------------------------------------------------------------------------------------------------- > > Key: SPARK-13531 > URL: https://issues.apache.org/jira/browse/SPARK-13531 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: koert kuipers > Priority: Minor > > this is using spark 2.0.0-SNAPSHOT > dataframe df1: > schema: > {noformat}StructType(StructField(x,IntegerType,true)){noformat} > explain: > {noformat}== Physical Plan == > MapPartitions <function1>, obj#135: object, [if (input[0, object].isNullAt) > null else input[0, object].get AS x#128] > +- MapPartitions <function1>, createexternalrow(if (isnull(x#9)) null else > x#9), [input[0, object] AS obj#135] > +- WholeStageCodegen > : +- Project [_1#8 AS x#9] > : +- Scan ExistingRDD[_1#8]{noformat} > show: > {noformat}+---+ > | x| > +---+ > | 2| > | 3| > +---+{noformat} > dataframe df2: > schema: > {noformat}StructType(StructField(x,IntegerType,true), > StructField(y,StringType,true)){noformat} > explain: > {noformat}== Physical Plan == > MapPartitions <function1>, createexternalrow(x#2, if (isnull(y#3)) null else > y#3.toString), [if (input[0, object].isNullAt) null else input[0, object].get > AS x#130,if (input[0, object].isNullAt) null else staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, > object].get, true) AS y#131] > +- WholeStageCodegen > : +- Project [_1#0 AS x#2,_2#1 AS y#3] > : +- Scan ExistingRDD[_1#0,_2#1]{noformat} > show: > {noformat}+---+---+ > | x| y| > +---+---+ > | 1| 1| > | 2| 2| > | 3| 3| > +---+---+{noformat} > i run: > df1.join(df2, Seq("x")).show > i get: > {noformat}java.lang.UnsupportedOperationException: No size estimation > available for objects. > at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323) > at > org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87){noformat} > now sure what changed, this ran about a week ago without issues (in our > internal unit tests). it is fully reproducible, however when i tried to > minimize the issue i could not reproduce it by just creating data frames in > the repl with the same contents, so it probably has something to do with way > these are created (from Row objects and StructTypes). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org