Look at how many job_ directories there are on your slave nodes. We're using Cloudera so they are under the 'userlogs' directory, not sure on 'pure' Apache where they are.
As we approach 30k we see this.(we run a monthly report does 10s of thousands of jobs in a few days) We've tried tuning the # of jobs stored in the history on the jobtracker but it doesn't always help. So we have an hourly cron job that finds any files older than 4 hours in that directory and removes them. None of our individual jobs runs for more than 30 minutes, so waiting 4 hours and blowing them away hasn't caused us any problems. On Thu, Apr 26, 2012 at 5:17 AM, JunYong Li <lij...@gmail.com> wrote: > maybe exists file hole, are -sh and du -sch /tmp results same? >