[ 
https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709624#comment-14709624
 ] 

Imran Rashid commented on SPARK-9850:
-------------------------------------

I know the 1000 partitions used in the design doc is just for an example, but I 
just wanted to point out that the number probably needs to be much larger.  
With a 2GB limit per partition, that is already 2 TB max.  My rule of thumb has 
been to keep partitions around 100 MB, which is roughly inline with the 64 MB 
you mention in the doc, which brings you down to 100 GB.  And given that you 
want to deal w/ skewed data etc., you probably actually want to leave quite a 
bit of room, which limits you to relatively small datasets.

The key point is that after you go over 2000 partitions, you are going into 
{{HighlyCompressedMapOutput}} territory, which will be relatively useless for 
this.  Perhaps each shuffle map task can always send an uncompressed map status 
back to the driver?  Maybe you could only use the {{HighlyCompressedMapStatus}} 
on the shuffle-read side?? I'm not sure about the performance implications, 
just throwing out an idea.

> Adaptive execution in Spark
> ---------------------------
>
>                 Key: SPARK-9850
>                 URL: https://issues.apache.org/jira/browse/SPARK-9850
>             Project: Spark
>          Issue Type: Epic
>          Components: Spark Core, SQL
>            Reporter: Matei Zaharia
>            Assignee: Yin Huai
>         Attachments: AdaptiveExecutionInSpark.pdf
>
>
> Query planning is one of the main factors in high performance, but the 
> current Spark engine requires the execution DAG for a job to be set in 
> advance. Even with cost­-based optimization, it is hard to know the behavior 
> of data and user-defined functions well enough to always get great execution 
> plans. This JIRA proposes to add adaptive query execution, so that the engine 
> can change the plan for each query as it sees what data earlier stages 
> produced.
> We propose adding this to Spark SQL / DataFrames first, using a new API in 
> the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, 
> the functionality could be extended to other libraries or the RDD API, but 
> that is more difficult than adding it in SQL.
> I've attached a design doc by Yin Huai and myself explaining how it would 
> work in more detail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to