[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider

Babulal (JIRA) Mon, 10 Sep 2018 10:24:15 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609534#comment-16609534
 ]


Babulal commented on SPARK-25332:
---------------------------------

Hi [~maropu] 

it seems to be a straightforward issue so raised directly. Actually issue 
happens  because Relation size is correct for same session but when restart the 
application ,HadoopFSRelation size became default relation size 
(spark.sql.defaultSizeInBytes which is LONG.MAXVALUE) that is why SortMergeJoin 
is choosen instead of broadcast join.

> Instead of broadcast hash join  ,Sort merge join has selected when restart 
> spark-shell/spark-JDBC for hive provider
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-25332
>                 URL: https://issues.apache.org/jira/browse/SPARK-25332
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Babulal
>            Priority: Major
>
> spark.sql("create table x1(name string,age int) stored as parquet ")
>  spark.sql("insert into x1 select 'a',29")
>  spark.sql("create table x2 (name string,age int) stored as parquet '")
>  spark.sql("insert into x2_ex select 'a',29")
>  scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, 
> BuildRight
> :- *(2) Project [name#101, age#102]
> : +- *(2) Filter isnotnull(name#101)
> : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct<name:string,age:int>
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
>  +- *(1) Project [name#103, age#104]
>  +- *(1) Filter isnotnull(name#103)
>  +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct<name:string,age:int>
>  
>  
> Now Restart Spark-Shell or do spark-submit orrestart JDBCServer  again and 
> run same select query again
>  
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#FF0000}(5) SortMergeJoin [{color}name#43], [name#45], Inner
> :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(name#43, 200)
> : +- *(1) Project [name#43, age#44]
> : +- *(1) Filter isnotnull(name#43)
> : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct<name:string,age:int>
> +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(name#45, 200)
>  +- *(3) Project [name#45, age#46]
>  +- *(3) Filter isnotnull(name#45)
>  +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct<name:string,age:int>
>  
>  
> scala> spark.sql("desc formatted x1").show(200,false)
> +----------------------------+--------------------------------------------------------------+-------+
> |col_name |data_type |comment|
> +----------------------------+--------------------------------------------------------------+-------+
> |name |string |null |
> |age |int |null |
> | | | |
> |# Detailed Table Information| | |
> |Database |default | |
> |Table |x1 | |
> |Owner |Administrator | |
> |Created Time |Sun Aug 19 12:36:58 IST 2018 | |
> |Last Access |Thu Jan 01 05:30:00 IST 1970 | |
> |Created By |Spark 2.3.0 | |
> |Type |MANAGED | |
> |Provider |hive | |
> |Table Properties |[transient_lastDdlTime=1534662418] | |
> |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | |
> |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | 
> |
> |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | 
> |
> |OutputFormat 
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
> |Storage Properties |[serialization.format=1] | |
> |Partition Provider |Catalog | |
> +----------------------------+--------------------------------------------------------------+-------+
>  
> With datasource table ,working fine ( create table using parquet instead of 
> stored by )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider

Reply via email to