I am trying to understand how spark types are kept in memory and accessed.
I tried to look at the code at the definition of MapType and ArrayType for 
example and I can't seem to find the relevant code for its actual 

I am trying to figure out how these two types are implemented to understand how 
they match my needs.
In general, it appears the size of a map is the same as two arrays which is 
about double the naïve array implementation: if I have 1000 rows, each with a 
map from 10K integers to 10K integers, I find through caching the dataframe 
that the total is is ~150MB (the naïve implementation of two arrays would code 
1000*10000*(4+4) or a total of ~80MB). I see the same size if I use two arrays. 
Second, what would be the performance of updating the map/arrays as they are 
immutable (i.e. some copying is required).

The reason I am asking this is because I wanted to do an aggregate function 
which calculates a variation of a histogram.
The most naïve solution for this would be to have a map from the bin to the 
count. But since we are talking about an immutable map, wouldn't that cost a 
lot more?
An even further optimization would be to use a mutable array where we combine 
the key and value to a single value (key and value are both int in my case). 
Assuming the maximum number of bins is small (e.g. less than 10), it is often 
cheaper to just search the array for the right key (and in this case the size 
of the data is expected to be significantly smaller than map). In my case, most 
of the type (90%) there are less than 3 elements in the bin and If I have more 
than 10 bins I basically do a combination to reduce the number.

For few elements, a map becomes very inefficient  - If I create 10M rows with 1 
map from int to int each I get an overall of ~380MB meaning ~38 bytes per 
element (instead of just 8). For array, again it is too large (229MB, i.e. ~23 
bytes per element).

Is there a way to implement a simple mutable array type to use in the 
aggregation buffer? Where is the portion of the code that handles the actual 
type handling?

View this message in context: 
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Reply via email to