[ 
https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-18676:
-----------------------------------
    Description: 
Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly 
modified the way Spark SQL estimates the output data size of query plans. I've 
found that—with the new table query partition pruning support in 2.1—this has 
lead to in some cases underestimation of join plan child size statistics to a 
degree that makes executing such queries impossible without disabling automatic 
broadcast conversion.

In one case we debugged, the query planner had estimated the size of a join 
child to be 3,854 bytes. In the execution of this child query, Spark reads 20 
million rows in 1 GB of data from parquet files and shuffles 722.9 MB of data, 
outputting 17 million rows. In planning the original join query, Spark converts 
the child to a {{BroadcastExchange}}. This query execution fails unless 
automatic broadcast conversion is disabled.

This particular query is complex and very specific to our data and schema. I 
have not yet developed a reproducible test case that can be shared. I realize 
this ticket does not give the Spark team a lot to work with to reproduce and 
test this issue, but I'm available to help. At the moment I can suggest running 
a join where one side is an aggregation selecting a few fields over a large 
table with a wide schema including many string columns.

This issue exists in Spark 2.0, but we never encountered it because in that 
version it only manifests itself for partitioned relations read from the 
filesystem, and we rarely use this feature. We've encountered this issue in 2.1 
because 2.1 does partition pruning for metastore tables now.

As a back stop, we've patched our branch of Spark 2.1 to revert the reductions 
in default data type size for string, binary and user-defined types. We also 
removed the override of the statistics method in {{UnaryNode}} which reduces 
the output size of a plan based on the ratio of that plan's output schema size 
versus its children's. We have not had this problem since.

  was:
Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly 
modified the way Spark SQL estimates the output data size of query plans. I've 
found that—with the new table query partition pruning support in 2.1—this has 
lead to in some cases underestimation of join plan child size statistics to a 
degree that makes executing such queries impossible without disabling automatic 
broadcast conversion.

In one case we debugged, the query planner had estimated the size of a join 
child to be 3,854 bytes. In the execution of this child query, Spark reads 20 
million rows in 1 GB of data from parquet files and shuffles 722.9 MB of data, 
outputting 17 million rows. In planning the original join query, Spark converts 
the child to a {{BroadcastExchange}}. This query execution fails unless 
automatic broadcast conversion is disabled.

This particular query is complex and very specific to our data and schema. I 
have not yet developed a reproducible test case that can be shared. I realize 
this ticket does not give the Spark team a lot to work with to reproduce and 
test this issue, but I'm available to help. At the moment I can suggest running 
a join where one side is an aggregation selecting a few fields over a large 
table with a wide schema including many string columns.

This issue exists in Spark 2.0, but we never encountered it because in that 
version it only manifests itself for partitioned relations read from the 
filesystem, and we rarely use this feature. We've encountered this issue in 2.1 
because 2.1 does partition pruning for metastore tables now.

As a back stop, we've patched our branch of Spark 2.1 to revert the reductions 
in default data type size for string, binary and user-defined types. We also 
removed the override of the statistics method in `LogicalPlan` which reduces 
the output size of a plan based on the ratio of that plan's output schema size 
versus its children's. We have not had this problem since.


> Spark 2.x query plan data size estimation can crash join queries versus 1.x
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-18676
>                 URL: https://issues.apache.org/jira/browse/SPARK-18676
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>            Reporter: Michael Allman
>
> Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly 
> modified the way Spark SQL estimates the output data size of query plans. 
> I've found that—with the new table query partition pruning support in 
> 2.1—this has lead to in some cases underestimation of join plan child size 
> statistics to a degree that makes executing such queries impossible without 
> disabling automatic broadcast conversion.
> In one case we debugged, the query planner had estimated the size of a join 
> child to be 3,854 bytes. In the execution of this child query, Spark reads 20 
> million rows in 1 GB of data from parquet files and shuffles 722.9 MB of 
> data, outputting 17 million rows. In planning the original join query, Spark 
> converts the child to a {{BroadcastExchange}}. This query execution fails 
> unless automatic broadcast conversion is disabled.
> This particular query is complex and very specific to our data and schema. I 
> have not yet developed a reproducible test case that can be shared. I realize 
> this ticket does not give the Spark team a lot to work with to reproduce and 
> test this issue, but I'm available to help. At the moment I can suggest 
> running a join where one side is an aggregation selecting a few fields over a 
> large table with a wide schema including many string columns.
> This issue exists in Spark 2.0, but we never encountered it because in that 
> version it only manifests itself for partitioned relations read from the 
> filesystem, and we rarely use this feature. We've encountered this issue in 
> 2.1 because 2.1 does partition pruning for metastore tables now.
> As a back stop, we've patched our branch of Spark 2.1 to revert the 
> reductions in default data type size for string, binary and user-defined 
> types. We also removed the override of the statistics method in {{UnaryNode}} 
> which reduces the output size of a plan based on the ratio of that plan's 
> output schema size versus its children's. We have not had this problem since.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to