Thanks Djordje :) I was able to prepare the input data file and now I'm trying to create category-based splits of Wikipedia dataset(41GB) and the training data set(5GB) using Mahout.
I had no problem with the training data set, but Hadoop showed following messages when I tried to do a same job with Wikipedia dataset, ......... 13/03/21 22:31:00 INFO mapred.JobClient: map 27% reduce 1% 13/03/21 22:40:31 INFO mapred.JobClient: map 27% reduce 2% 13/03/21 22:58:49 INFO mapred.JobClient: map 27% reduce 3% 13/03/21 23:22:57 INFO mapred.JobClient: map 27% reduce 4% 13/03/21 23:46:32 INFO mapred.JobClient: map 27% reduce 5% 13/03/22 00:27:14 INFO mapred.JobClient: map 27% reduce 6% 13/03/22 01:06:55 INFO mapred.JobClient: map 27% reduce 7% 13/03/22 01:14:06 INFO mapred.JobClient: map 27% reduce 3% 13/03/22 01:15:35 INFO mapred.JobClient: Task Id : attempt_201303211339_0002_r_000000_1, Status : FAILED Task attempt_201303211339_0002_r_000000_1 failed to report status for 1200 seconds. Killing! 13/03/22 01:20:09 INFO mapred.JobClient: map 27% reduce 4% 13/03/22 01:33:35 INFO mapred.JobClient: Task Id : attempt_201303211339_0002_m_000037_1, Status : FAILED Task attempt_201303211339_0002_m_000037_1 failed to report status for 1228 seconds. Killing! 13/03/22 01:35:12 INFO mapred.JobClient: map 27% reduce 5% 13/03/22 01:40:38 INFO mapred.JobClient: map 27% reduce 6% 13/03/22 01:52:28 INFO mapred.JobClient: map 27% reduce 7% 13/03/22 02:16:27 INFO mapred.JobClient: map 27% reduce 8% 13/03/22 02:19:02 INFO mapred.JobClient: Task Id : attempt_201303211339_0002_m_000018_1, Status : FAILED Task attempt_201303211339_0002_m_000018_1 failed to report status for 1204 seconds. Killing! 13/03/22 02:49:03 INFO mapred.JobClient: map 27% reduce 9% 13/03/22 02:52:04 INFO mapred.JobClient: map 28% reduce 9% ........ Reduce falls back to the previous point and the process gets end at map 46%, reduce 2% without being completed. Is this also relevant to the heap and DRAM size? I was wondering if increasing outage time will help or not.. On Fri, Mar 22, 2013 at 8:46 AM, Djordje Jevdjic <[email protected]>wrote: > Dear Jinchun, > > The warning message that you get is irrelevant. The problem seems to be in > the amount of memory that is given to the map-reduce tasks. You need to > increase the heap size (e.g., run -Xmx 2048M) and make sure that you have > enough DRAM for the heap size you indicate. To change the heap size, edit > the following file > $HADOOP_HOME/conf/mapred-site.xml > and specify the heap size by adding/changing the following parameter > mapred.child.java.opts > > If your machine doesn't have enough DRAM, the whole process of preparing > the data and the model is indeed expected to take a couple of hours. > > Regards, > Djordje > ________________________________________ > From: Jinchun Kim [[email protected]] > Sent: Friday, March 22, 2013 1:14 PM > To: [email protected] > Subject: Question about data analytic > > Hi, All. > > I'm trying to run Data analytic on my x86, Ubuntu machine. > I found that when I divided 30GB Wikipedia input data into small chunks of > 64MB, > CPU usage was really low. > It was checked by /usr/bin/time command. > Most of execution time was idle and waiting. > User cpu time was only 13% of total running time. > > Is it because I'm running Data analytic with single node? > Or does it have something to do with following warning message..? > > WARN driver.MahoutDriver: No wikipediaXMLSplitter.props found on classpath, > will use command-line arguments only > > I don't understand why user cpu time is so low while it takes 2.5 hours to > finish > splitting Wikipedia inputs. > Thanks! > > -- > Jinchun Kim > -- *Jinchun Kim*
