Maciej Bryński created SPARK-11282:
--------------------------------------

             Summary: Very strange broadcast join behaviour
                 Key: SPARK-11282
                 URL: https://issues.apache.org/jira/browse/SPARK-11282
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 1.5.1
            Reporter: Maciej Bryński
            Priority: Critical


Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:

spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to