Memory usage for spark types

assaf.mendelson Sun, 18 Sep 2016 09:06:46 -0700

Hi,
I am trying to understand how spark types are kept in memory and accessed.
I tried to look at the code at the definition of MapType and ArrayType for 
example and I can't seem to find the relevant code for its actual 
implementation.

I am trying to figure out how these two types are implemented to understand how
they match my needs.
In general, it appears the size of a map is the same as two arrays which is
about double the naïve array implementation: if I have 1000 rows, each with a
map from 10K integers to 10K integers, I find through caching the dataframe
that the total is is ~150MB (the naïve implementation of two arrays would code
1000*10000*(4+4) or a total of ~80MB). I see the same size if I use two arrays.
Second, what would be the performance of updating the map/arrays as they are
immutable (i.e. some copying is required).

The reason I am asking this is because I wanted to do an aggregate function
which calculates a variation of a histogram.
The most naïve solution for this would be to have a map from the bin to the
count. But since we are talking about an immutable map, wouldn't that cost a
lot more?
An even further optimization would be to use a mutable array where we combine
the key and value to a single value (key and value are both int in my case).
Assuming the maximum number of bins is small (e.g. less than 10), it is often
cheaper to just search the array for the right key (and in this case the size
of the data is expected to be significantly smaller than map). In my case, most
of the type (90%) there are less than 3 elements in the bin and If I have more
than 10 bins I basically do a combination to reduce the number.

For few elements, a map becomes very inefficient - If I create 10M rows with 1
map from int to int each I get an overall of ~380MB meaning ~38 bytes per
element (instead of just 8). For array, again it is too large (229MB, i.e. ~23
bytes per element).

Is there a way to implement a simple mutable array type to use in the
aggregation buffer? Where is the portion of the code that handles the actual
type handling?
Thanks,
Assaf.

--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Memory-usage-for-spark-types-tp18984.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Memory usage for spark types

Reply via email to