RE: Question about data analytic

Djordje Jevdjic Sun, 24 Mar 2013 17:48:24 -0700

1.5GB should be enough to run the benchmark (i.e., the Mahout
"testclassifier" program) correctly with the given inputs. However, I do 
not remember if it is enough for the model creation (trainclassifier). 
You can try with a couple of sizes and with a single map job to find 
the minimum heap size that you need such that the job doesn't crash.


 In general, if your machine does not have enough DRAM, the solution 
is not to reduce the heap size per processes, but to reduce the number 
of processes. This might leave your machine underutilized though, but 
that's not a problem for the model creation. The actual benchmark is 
only the last step, which your machine should handle well (4 cores and
8GB of DRAM seems ok). 

As for the reducers, their job in this benchmark is much simpler 
compared to the mappers, you can assume that they don't need 
much heap.  

Regards,
Djordje
________________________________________
From: Jinchun Kim [[email protected]]
Sent: Monday, March 25, 2013 12:56 AM
To: Djordje Jevdjic
Cc: [email protected]
Subject: Re: Question about data analytic

Thanks Djordje.

The heap size indicated in mapred-site.xml is set to -Xmx 2048M and my machine 
has 8GB DRAM.
Based on your reply to Fu, 
(http://www.mail-archive.com/[email protected]/msg00019.html)
I'm using 4 mappers and 2 reducers, so I guess my machine is not able to
run benchmark with 2GB heap size.
In the reply, you said



number of maps = number of cores you want to run this on
number of reduce jobs = 1, unless the number of mappers is >8
amount of memory = number of mappers * heap size





Thus, running 4 mappers will require 8GB heap size in total, which is not 
available for my machine


because OS and other processes might use the heap area also.


I'm going to reduce the heap size for 1.5GB and try it again.




What I'm wondering is if reducers also use the heap area...


If so, I need to decrease the number of reducers and hep size.

Does each reducer require the heap area?


On Sun, Mar 24, 2013 at 6:05 PM, Djordje Jevdjic 
<[email protected]<mailto:[email protected]>> wrote:
Dear Jinchun,

A timeout of 1200sec is already too generous. Increasing it will not solve the 
problem.
I cannot see your logs, but yes, the problem again seems to be the indicated 
heap size
and the DRAM capacity your machine has.

Regards,
Djordje
________________________________________
From: Jinchun Kim [[email protected]<mailto:[email protected]>]
Sent: Friday, March 22, 2013 3:04 PM
To: Djordje Jevdjic
Cc: [email protected]<mailto:[email protected]>
Subject: Re: Question about data analytic

Thanks Djordje :)
I was able to prepare the input data file and now I'm trying to create 
category-based splits of
Wikipedia dataset(41GB) and the training data set(5GB) using Mahout.

I had no problem with the training data set, but Hadoop showed following 
messages
when I tried to do a same job with Wikipedia dataset,

.........
13/03/21 22:31:00 INFO mapred.JobClient:  map 27% reduce 1%
13/03/21 22:40:31 INFO mapred.JobClient:  map 27% reduce 2%
13/03/21 22:58:49 INFO mapred.JobClient:  map 27% reduce 3%
13/03/21 23:22:57 INFO mapred.JobClient:  map 27% reduce 4%
13/03/21 23:46:32 INFO mapred.JobClient:  map 27% reduce 5%
13/03/22 00:27:14 INFO mapred.JobClient:  map 27% reduce 6%
13/03/22 01:06:55 INFO mapred.JobClient:  map 27% reduce 7%
13/03/22 01:14:06 INFO mapred.JobClient:  map 27% reduce 3%
13/03/22 01:15:35 INFO mapred.JobClient: Task Id : 
attempt_201303211339_0002_r_000000_1, Status : FAILED
Task attempt_201303211339_0002_r_000000_1 failed to report status for 1200 
seconds. Killing!
13/03/22 01:20:09 INFO mapred.JobClient:  map 27% reduce 4%
13/03/22 01:33:35 INFO mapred.JobClient: Task Id : 
attempt_201303211339_0002_m_000037_1, Status : FAILED
Task attempt_201303211339_0002_m_000037_1 failed to report status for 1228 
seconds. Killing!
13/03/22 01:35:12 INFO mapred.JobClient:  map 27% reduce 5%
13/03/22 01:40:38 INFO mapred.JobClient:  map 27% reduce 6%
13/03/22 01:52:28 INFO mapred.JobClient:  map 27% reduce 7%
13/03/22 02:16:27 INFO mapred.JobClient:  map 27% reduce 8%
13/03/22 02:19:02 INFO mapred.JobClient: Task Id : 
attempt_201303211339_0002_m_000018_1, Status : FAILED
Task attempt_201303211339_0002_m_000018_1 failed to report status for 1204 
seconds. Killing!
13/03/22 02:49:03 INFO mapred.JobClient:  map 27% reduce 9%
13/03/22 02:52:04 INFO mapred.JobClient:  map 28% reduce 9%
........

Reduce falls back to the previous point and the process gets end at map 46%, 
reduce 2% without being completed.
Is this also relevant to the heap and DRAM size?
I was wondering if increasing outage time will help or not..


On Fri, Mar 22, 2013 at 8:46 AM, Djordje Jevdjic 
<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
 wrote:
Dear Jinchun,

The warning message that you get is irrelevant. The problem seems to be in
the amount of memory that is given to the map-reduce tasks. You need to
increase the heap size (e.g., run -Xmx 2048M) and make sure that you have
enough DRAM for the heap size you indicate. To change the heap size, edit
the following file
$HADOOP_HOME/conf/mapred-site.xml
and specify the heap size by adding/changing the following parameter
mapred.child.java.opts

If your machine doesn't have enough DRAM, the whole process of preparing
the data and the model is indeed expected to take a couple of hours.

Regards,
Djordje
________________________________________
From: Jinchun Kim 
[[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>]
Sent: Friday, March 22, 2013 1:14 PM
To: 
[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>
Subject: Question about data analytic

Hi, All.

I'm trying to run Data analytic on my x86, Ubuntu machine.
I found that when I divided 30GB Wikipedia input data into small chunks of 64MB,
CPU usage was really low.
It was checked by /usr/bin/time command.
Most of execution time was idle and waiting.
User cpu time was only 13% of total running time.

Is it because I'm running Data analytic with single node?
Or does it have something to do with following warning message..?

WARN driver.MahoutDriver: No wikipediaXMLSplitter.props found on classpath,
will use command-line arguments only

I don't understand why user cpu time is so low while it takes 2.5 hours to 
finish
splitting Wikipedia inputs.
Thanks!

--
Jinchun Kim



--
Jinchun Kim



--
Thanks,
Jinchun Kim

RE: Question about data analytic

Reply via email to