Vijay Murthi wrote:
Are you running the current trunk? My guess is that you are. If so,
then this error is "normal", things should keep running.
I am using hadoop-0.2.0. I believe this is the current trunk.
No, that's a release. The trunk is what's currently in Subversion.
I used to
think child task exit with "Out of memory" is normal since the job can
be re-executed on another machine that finish whereas Tasktracker which
manages should not. After this message I see only one Tasktracker
running on each node with "99%" on CPU all the time not any reduce task.
It sounds like these "Out of memory" errors are fatal.
On "mapred" local directory I see it writing to directory of name
"*_r_*". Since every output map task produce is on local disk can't it
just read those reduce files Map task create?
The local map output files are mostly needed by reduces running on other
nodes and must first be transferred.
I am running on 64 bit kernel with JVM set to 32-bit. The JAVA heap size
set to a maximum of 1 GB for both Tasktracker and child process. I
believe Tasktracker and each child process runs on its own JVM of 1 GB
(correct, me if I am wrong). Does each child process should have less
memory than Tasktracker or total of memory of child process it manage
should be less than Tasktracker memory heap since Tasktracker creates
the children? In my case, I am setting 500 MB of sort memory for each
child reduce process. So 3 reduce task * 500 MB can be more than 1 GB
and causes "Out of memory"?
Why are you using 500MB of sort memory with a 1GB heap if it keeps
causing problems? I would suggest either decreasing the sort memory or
increasing the heap size. Better yet, start with the defaults and
change one parameter at a time.
4MB buffers for file streams seems large to me.
I keep 4 MB buffer because each Map task reading around 2 GB Gzip text
file. I thought this will make the reading process efficient and 4 MB *
3 map task per node is like 12 MB. Not sure, why this is lot.
Again, changing one setting at a time will allow you to better figure
out what improves things and what causes problems. This parameter is
used for lots of files, more than just your input data, so increasing it
to 4MB causes lots of 4MB buffers to be created. I have a hard time
seeing a justification ever for buffers larger than 1MB, as even 100k
should usually cause transfer to dominate seek, but, since map and
reduce both operate sequentially, even 100k should not be required for
good performance.
So you could even use a sort factor of 500. That would
make sorts a lot faster.
Ok I will try that. I have around 120 reduce files in total each around
1 GB for 6 reduce process.
Please first try things with the defaults. Then try increasing the sort
factor to find if that improves things for you.
Also why are you setting the task timeout so high? Do you have
mappers
or reducers that take a long time per entry and are not calling
Reporter.setStatus() regularly? That can cause tasks to time out.
Yes. Map task sometime take a long time and got killed. I have a
reporter that set status when record reader is created. Still things get
printed on the web page only after the task exit with Succeed or Failure
status.
If processing a single record could take longer than the task timeout
(10 minutes) then you should call setStatus() during the processing of
the record to avoid timeouts. That's a better way to fix this than to
increase the task timeout. Note that setStatus() is efficient: don't
worry about calling it too often.
Doug