Do note that SharedArrays will only work with bitstype arrays, not with,
for example, an array of Strings.

Assuming that you are parsing these files and generating an large amount of
small strings, I would suspect that the time being taken is in serializing
and deserializing these large amount of small strings.

Just as an example serializing - deserializing 2 million small strings of
an average length of 6 bytes takes around 4 seconds on my machine while a
single 12MB string takes 0.03 seconds.

I haven't tried, but it may be faster to return a single large delimited
string ( one line for every delimited key-value pair with your own
delimiter) and then build your dictionary on the master process. Overall
you will be generating only half the number of small strings.




On Sat, May 31, 2014 at 12:15 PM, Kuba Roth <[email protected]> wrote:

> Hi Amit,
> Well in my case I'm parsing a bunch of files, store results in
> dictionaries which are merged back into one big array of dictionaries.
> Since each file can be parsed independently pmap seems to be good and clean
> fit. But because size of each Dictionary is quite big merging the data back
> is super slow.
> Perhaps pmap is not  best answer to this problem and I should look further
> into shared arrays (which unfortunately I haven't had time right now)
> kuba
>

Reply via email to