Dear Jayneel, 

The way the workload is set up corresponds to a typical use of the Hadoop 
Map-Reduce framework, which means that each map task is a separate process. The 
map task itself can be multithreaded though, but we use Mahout's version of the 
classification algorithm, which is single-threaded. 

So,  there is no multithreaded version in the release,  but we may consider 
implementing one. That's a good idea.

Regarding the data sets: the workload is tuned to use ~1.5GB of memory per 
core, putting a lot of pressure on last-level caches up to 20-30MB. You can 
increase the memory requirement by using a larger model (not the data set). I 
can give you another links to download larger training sets. That could be 
useful for benchmarking on real hardware. On the other hand, if you are using 
the benchmark within a simulation framework, it might become too slow with 
higher core counts.

Regards,
Djordje


________________________________________
From: Jayneel Gandhi [[email protected]]
Sent: Wednesday, July 11, 2012 8:22 PM
To: [email protected]
Subject: Data Analytics: multithreaded vs multiprocess

Hi,

I was able to get Data Analytics workload to work. The steps on the web page 
tunes the workload as a multiprocess workload on a multi core machine. I wanted 
to know if there is a way to change the config to run it as a single process 
but run map reduce with multiple threads in that process.

Also, I wanted to know if you have some other algorithms or data sets in data 
analytics that can be more memory intensive?

Thanks,
Jayneel

Reply via email to