Thanks Djordje :)
I was able to prepare the input data file and now I'm trying to create
category-based splits of
Wikipedia dataset(41GB) and the training data set(5GB) using Mahout.

I had no problem with the training data set, but Hadoop showed following
messages
when I tried to do a same job with Wikipedia dataset,

.........
13/03/21 22:31:00 INFO mapred.JobClient:  map 27% reduce 1%
13/03/21 22:40:31 INFO mapred.JobClient:  map 27% reduce 2%
13/03/21 22:58:49 INFO mapred.JobClient:  map 27% reduce 3%
13/03/21 23:22:57 INFO mapred.JobClient:  map 27% reduce 4%
13/03/21 23:46:32 INFO mapred.JobClient:  map 27% reduce 5%
13/03/22 00:27:14 INFO mapred.JobClient:  map 27% reduce 6%
13/03/22 01:06:55 INFO mapred.JobClient:  map 27% reduce 7%
13/03/22 01:14:06 INFO mapred.JobClient:  map 27% reduce 3%
13/03/22 01:15:35 INFO mapred.JobClient: Task Id :
attempt_201303211339_0002_r_000000_1, Status : FAILED
Task attempt_201303211339_0002_r_000000_1 failed to report status for 1200
seconds. Killing!
13/03/22 01:20:09 INFO mapred.JobClient:  map 27% reduce 4%
13/03/22 01:33:35 INFO mapred.JobClient: Task Id :
attempt_201303211339_0002_m_000037_1, Status : FAILED
Task attempt_201303211339_0002_m_000037_1 failed to report status for 1228
seconds. Killing!
13/03/22 01:35:12 INFO mapred.JobClient:  map 27% reduce 5%
13/03/22 01:40:38 INFO mapred.JobClient:  map 27% reduce 6%
13/03/22 01:52:28 INFO mapred.JobClient:  map 27% reduce 7%
13/03/22 02:16:27 INFO mapred.JobClient:  map 27% reduce 8%
13/03/22 02:19:02 INFO mapred.JobClient: Task Id :
attempt_201303211339_0002_m_000018_1, Status : FAILED
Task attempt_201303211339_0002_m_000018_1 failed to report status for 1204
seconds. Killing!
13/03/22 02:49:03 INFO mapred.JobClient:  map 27% reduce 9%
13/03/22 02:52:04 INFO mapred.JobClient:  map 28% reduce 9%
........

Reduce falls back to the previous point and the process gets end at map
46%, reduce 2% without being completed.
Is this also relevant to the heap and DRAM size?
I was wondering if increasing outage time will help or not..


On Fri, Mar 22, 2013 at 8:46 AM, Djordje Jevdjic <[email protected]>wrote:

> Dear Jinchun,
>
> The warning message that you get is irrelevant. The problem seems to be in
> the amount of memory that is given to the map-reduce tasks. You need to
> increase the heap size (e.g., run -Xmx 2048M) and make sure that you have
> enough DRAM for the heap size you indicate. To change the heap size, edit
> the following file
> $HADOOP_HOME/conf/mapred-site.xml
> and specify the heap size by adding/changing the following parameter
> mapred.child.java.opts
>
> If your machine doesn't have enough DRAM, the whole process of preparing
> the data and the model is indeed expected to take a couple of hours.
>
> Regards,
> Djordje
> ________________________________________
> From: Jinchun Kim [[email protected]]
> Sent: Friday, March 22, 2013 1:14 PM
> To: [email protected]
> Subject: Question about data analytic
>
> Hi, All.
>
> I'm trying to run Data analytic on my x86, Ubuntu machine.
> I found that when I divided 30GB Wikipedia input data into small chunks of
> 64MB,
> CPU usage was really low.
> It was checked by /usr/bin/time command.
> Most of execution time was idle and waiting.
> User cpu time was only 13% of total running time.
>
> Is it because I'm running Data analytic with single node?
> Or does it have something to do with following warning message..?
>
> WARN driver.MahoutDriver: No wikipediaXMLSplitter.props found on classpath,
> will use command-line arguments only
>
> I don't understand why user cpu time is so low while it takes 2.5 hours to
> finish
> splitting Wikipedia inputs.
> Thanks!
>
> --
> Jinchun Kim
>



-- 
*Jinchun Kim*

Reply via email to