[ 
https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15226969#comment-15226969
 ] 

Herman van Hovell commented on SPARK-14389:
-------------------------------------------

[~Steve Johnston] I follow your code correctly you do the following:
1. Read the table from CSV
2. Convert it
3. Repartition it to one node
4. Perform a cartesian self join (which is planned as a BNL).

This will probably be executed on a single node (which is bad) because of the 
repartitioning (we would need the physical plan to be sure). Is the 
repartitioning there for the sake of the example, or is it a part of the 
testing regime. What strikes me as odd is that the cartesian between the 
lineitem datasets yields a dataset of about 3.6 milion records; which is not 
that big.

Could you post the physical plan ({{joined_dataframe.explain(true)}})?

> OOM during BroadcastNestedLoopJoin
> ----------------------------------
>
>                 Key: SPARK-14389
>                 URL: https://issues.apache.org/jira/browse/SPARK-14389
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>         Environment: OS: Amazon Linux AMI 2015.09
> EMR: 4.3.0
> Hadoop: Amazon 2.7.1
> Spark 1.6.0
> Ganglia 3.7.2
> Master: m3.xlarge
> Core: m3.xlarge
> m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD
>            Reporter: Steve Johnston
>         Attachments: lineitem.tbl, sample_script.py, stdout.txt
>
>
> When executing attached sample_script.py in client mode with a single 
> executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", 
> during the self join of a small table, TPC-H lineitem generated for a 1M 
> dataset. Also see execution log stdout.txt attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to