Take a look at UnsafeArrayData and UnsafeMapData.
On Sun, Sep 18, 2016 at 9:06 AM, assaf.mendelson <assaf.mendel...@rsa.com>
> I am trying to understand how spark types are kept in memory and accessed.
> I tried to look at the code at the definition of MapType and ArrayType for
> example and I can’t seem to find the relevant code for its actual
> I am trying to figure out how these two types are implemented to
> understand how they match my needs.
> In general, it appears the size of a map is the same as two arrays which
> is about double the naïve array implementation: if I have 1000 rows, each
> with a map from 10K integers to 10K integers, I find through caching the
> dataframe that the total is is ~150MB (the naïve implementation of two
> arrays would code 1000*10000*(4+4) or a total of ~80MB). I see the same
> size if I use two arrays. Second, what would be the performance of updating
> the map/arrays as they are immutable (i.e. some copying is required).
> The reason I am asking this is because I wanted to do an aggregate
> function which calculates a variation of a histogram.
> The most naïve solution for this would be to have a map from the bin to
> the count. But since we are talking about an immutable map, wouldn’t that
> cost a lot more?
> An even further optimization would be to use a mutable array where we
> combine the key and value to a single value (key and value are both int in
> my case). Assuming the maximum number of bins is small (e.g. less than 10),
> it is often cheaper to just search the array for the right key (and in this
> case the size of the data is expected to be significantly smaller than
> map). In my case, most of the type (90%) there are less than 3 elements in
> the bin and If I have more than 10 bins I basically do a combination to
> reduce the number.
> For few elements, a map becomes very inefficient - If I create 10M rows
> with 1 map from int to int each I get an overall of ~380MB meaning ~38
> bytes per element (instead of just 8). For array, again it is too large
> (229MB, i.e. ~23 bytes per element).
> Is there a way to implement a simple mutable array type to use in the
> aggregation buffer? Where is the portion of the code that handles the
> actual type handling?
> View this message in context: Memory usage for spark types
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at