You should take also into account that spark has different option to represent 
data in-memory, such as Java serialized objects, Kyro serialized, Tungsten 
(columnar optionally compressed) etc. the tungsten thing depends heavily on the 
underlying data and sorting especially if compressed.
Then, you might think also about broadcasted data etc.

As such I am not aware of a specific guide, but there is also no magic behind 
it. could be a good jira task :) 

> On 22 Sep 2016, at 08:36, Hemant Bhanawat <> wrote:
> I am working on profiling TPCH queries for Spark 2.0.  I see lot of temporary 
> object creation (sometimes size as much as the data size) which is justified 
> for the kind of processing Spark does. But, from production perspective, is 
> there a guideline on how much memory should be allocated for processing a 
> specific data size of let's say parquet data? Also, has someone investigated 
> memory usage for the individual SQL operators like Filter, group by, order 
> by, Exchange etc.? 
> Hemant Bhanawat

Reply via email to