Dear Jayneel, The way the workload is set up corresponds to a typical use of the Hadoop Map-Reduce framework, which means that each map task is a separate process. The map task itself can be multithreaded though, but we use Mahout's version of the classification algorithm, which is single-threaded.
So, there is no multithreaded version in the release, but we may consider implementing one. That's a good idea. Regarding the data sets: the workload is tuned to use ~1.5GB of memory per core, putting a lot of pressure on last-level caches up to 20-30MB. You can increase the memory requirement by using a larger model (not the data set). I can give you another links to download larger training sets. That could be useful for benchmarking on real hardware. On the other hand, if you are using the benchmark within a simulation framework, it might become too slow with higher core counts. Regards, Djordje ________________________________________ From: Jayneel Gandhi [[email protected]] Sent: Wednesday, July 11, 2012 8:22 PM To: [email protected] Subject: Data Analytics: multithreaded vs multiprocess Hi, I was able to get Data Analytics workload to work. The steps on the web page tunes the workload as a multiprocess workload on a multi core machine. I wanted to know if there is a way to change the config to run it as a single process but run map reduce with multiple threads in that process. Also, I wanted to know if you have some other algorithms or data sets in data analytics that can be more memory intensive? Thanks, Jayneel
