The issue is fixed in branch 0.21 through http://issues.apache.org/jira/browse/MAPREDUCE-927. Now, the attempt directories are moved inside job directory. So, userlogs directory will have only job directories.
Thanks Amareshwari On 6/16/10 12:47 PM, "Johannes Zillmann" <[email protected]> wrote: Hi Edward, i copied the userlogs folder which caused the error. Two things which is speak against the too-many files theory. a) i can add new files to this folder (touch userlogsOLD/a, etc... ) b) the sysctl fs.file-max shows 817874 whereas the file count on the first level of userlogsOLD is 31999 and all files recursively are 107400. Any thoughts ? Johannes On Jun 14, 2010, at 7:47 PM, Edward Capriolo wrote: > On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann <[email protected] >> wrote: > >> Hi, >> >> i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into >> a situation where every task scheduled on 2 of the 4 nodes failed. >> Seems like the child jvm crashes. There are no child logs under >> logs/userlogs. Tasktracker gives this: >> >> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In >> JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604 >> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM >> Runner jvm_201006091425_0049_m_-946174604 spawned. >> 2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM : >> jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0 >> 2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner: >> attempt_201006091425_0049_m_003179_0 Child Error >> java.io.IOException: Task process exit with nonzero status of 1. >> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) >> >> >> At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job >> created the logs/userlogs again and no error ocuured anymore on this host. >> The permissions of userlogs and userlogsOLD are exactly the same. >> userlogsOLD contains about 378M in 132747 files. When copying the content of >> userlogsOLD into userlogs, the tasks of the belonging node starts failing >> again. >> >> Some questions: >> - this seems to me like a problem with too many files in one folder - any >> thoughts on this ? >> - is the content of logs/userlogs cleaned up by hadoop regularly ? >> - the logs/stdout file of the tasks are not existent, the logs/out fiels of >> the tasktracker hasn't any specific message (other then message posted >> above) - is there any log file left where an error message could be found ? >> >> >> best regards >> Johannes > > > Most file systems have an upper limit on number of subfiles/folders in a > folder. You have probably hit the EXT3 limit. If you launch lots and lots of > jobs you can hit the limit before any cleanup happens. > > You can experiment with cleanup and other filesystems. The following log > related issue might be relevant. > > https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614 > > Regards, > Edward
