1.5GB should be enough to run the benchmark (i.e., the Mahout "testclassifier" program) correctly with the given inputs. However, I do not remember if it is enough for the model creation (trainclassifier). You can try with a couple of sizes and with a single map job to find the minimum heap size that you need such that the job doesn't crash.
In general, if your machine does not have enough DRAM, the solution is not to reduce the heap size per processes, but to reduce the number of processes. This might leave your machine underutilized though, but that's not a problem for the model creation. The actual benchmark is only the last step, which your machine should handle well (4 cores and 8GB of DRAM seems ok). As for the reducers, their job in this benchmark is much simpler compared to the mappers, you can assume that they don't need much heap. Regards, Djordje ________________________________________ From: Jinchun Kim [[email protected]] Sent: Monday, March 25, 2013 12:56 AM To: Djordje Jevdjic Cc: [email protected] Subject: Re: Question about data analytic Thanks Djordje. The heap size indicated in mapred-site.xml is set to -Xmx 2048M and my machine has 8GB DRAM. Based on your reply to Fu, (http://www.mail-archive.com/[email protected]/msg00019.html) I'm using 4 mappers and 2 reducers, so I guess my machine is not able to run benchmark with 2GB heap size. In the reply, you said number of maps = number of cores you want to run this on number of reduce jobs = 1, unless the number of mappers is >8 amount of memory = number of mappers * heap size Thus, running 4 mappers will require 8GB heap size in total, which is not available for my machine because OS and other processes might use the heap area also. I'm going to reduce the heap size for 1.5GB and try it again. What I'm wondering is if reducers also use the heap area... If so, I need to decrease the number of reducers and hep size. Does each reducer require the heap area? On Sun, Mar 24, 2013 at 6:05 PM, Djordje Jevdjic <[email protected]<mailto:[email protected]>> wrote: Dear Jinchun, A timeout of 1200sec is already too generous. Increasing it will not solve the problem. I cannot see your logs, but yes, the problem again seems to be the indicated heap size and the DRAM capacity your machine has. Regards, Djordje ________________________________________ From: Jinchun Kim [[email protected]<mailto:[email protected]>] Sent: Friday, March 22, 2013 3:04 PM To: Djordje Jevdjic Cc: [email protected]<mailto:[email protected]> Subject: Re: Question about data analytic Thanks Djordje :) I was able to prepare the input data file and now I'm trying to create category-based splits of Wikipedia dataset(41GB) and the training data set(5GB) using Mahout. I had no problem with the training data set, but Hadoop showed following messages when I tried to do a same job with Wikipedia dataset, ......... 13/03/21 22:31:00 INFO mapred.JobClient: map 27% reduce 1% 13/03/21 22:40:31 INFO mapred.JobClient: map 27% reduce 2% 13/03/21 22:58:49 INFO mapred.JobClient: map 27% reduce 3% 13/03/21 23:22:57 INFO mapred.JobClient: map 27% reduce 4% 13/03/21 23:46:32 INFO mapred.JobClient: map 27% reduce 5% 13/03/22 00:27:14 INFO mapred.JobClient: map 27% reduce 6% 13/03/22 01:06:55 INFO mapred.JobClient: map 27% reduce 7% 13/03/22 01:14:06 INFO mapred.JobClient: map 27% reduce 3% 13/03/22 01:15:35 INFO mapred.JobClient: Task Id : attempt_201303211339_0002_r_000000_1, Status : FAILED Task attempt_201303211339_0002_r_000000_1 failed to report status for 1200 seconds. Killing! 13/03/22 01:20:09 INFO mapred.JobClient: map 27% reduce 4% 13/03/22 01:33:35 INFO mapred.JobClient: Task Id : attempt_201303211339_0002_m_000037_1, Status : FAILED Task attempt_201303211339_0002_m_000037_1 failed to report status for 1228 seconds. Killing! 13/03/22 01:35:12 INFO mapred.JobClient: map 27% reduce 5% 13/03/22 01:40:38 INFO mapred.JobClient: map 27% reduce 6% 13/03/22 01:52:28 INFO mapred.JobClient: map 27% reduce 7% 13/03/22 02:16:27 INFO mapred.JobClient: map 27% reduce 8% 13/03/22 02:19:02 INFO mapred.JobClient: Task Id : attempt_201303211339_0002_m_000018_1, Status : FAILED Task attempt_201303211339_0002_m_000018_1 failed to report status for 1204 seconds. Killing! 13/03/22 02:49:03 INFO mapred.JobClient: map 27% reduce 9% 13/03/22 02:52:04 INFO mapred.JobClient: map 28% reduce 9% ........ Reduce falls back to the previous point and the process gets end at map 46%, reduce 2% without being completed. Is this also relevant to the heap and DRAM size? I was wondering if increasing outage time will help or not.. On Fri, Mar 22, 2013 at 8:46 AM, Djordje Jevdjic <[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>> wrote: Dear Jinchun, The warning message that you get is irrelevant. The problem seems to be in the amount of memory that is given to the map-reduce tasks. You need to increase the heap size (e.g., run -Xmx 2048M) and make sure that you have enough DRAM for the heap size you indicate. To change the heap size, edit the following file $HADOOP_HOME/conf/mapred-site.xml and specify the heap size by adding/changing the following parameter mapred.child.java.opts If your machine doesn't have enough DRAM, the whole process of preparing the data and the model is indeed expected to take a couple of hours. Regards, Djordje ________________________________________ From: Jinchun Kim [[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>] Sent: Friday, March 22, 2013 1:14 PM To: [email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>> Subject: Question about data analytic Hi, All. I'm trying to run Data analytic on my x86, Ubuntu machine. I found that when I divided 30GB Wikipedia input data into small chunks of 64MB, CPU usage was really low. It was checked by /usr/bin/time command. Most of execution time was idle and waiting. User cpu time was only 13% of total running time. Is it because I'm running Data analytic with single node? Or does it have something to do with following warning message..? WARN driver.MahoutDriver: No wikipediaXMLSplitter.props found on classpath, will use command-line arguments only I don't understand why user cpu time is so low while it takes 2.5 hours to finish splitting Wikipedia inputs. Thanks! -- Jinchun Kim -- Jinchun Kim -- Thanks, Jinchun Kim
