I like Solution 3 as long as there was an API to query the logs. ben
On Tuesday 29 August 2006 11:33, Mahadev konar (JIRA) wrote: > Seperating user logs from system logs in map reduce > --------------------------------------------------- > > Key: HADOOP-489 > URL: http://issues.apache.org/jira/browse/HADOOP-489 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Reporter: Mahadev konar > Assigned To: Mahadev konar > Priority: Minor > > > Currently the user logs are a part of system logs in mapreduce. Anything > logged by the user is logged into the tasktracker log files. This create > two issues- 1) The system log files get cluttered with user output. If the > user outputs a large amount of logs, the system logs need to be cleaned up > pretty often. 2) For the user, it is difficult to get to each of the > machines and look for the logs his/her job might have generated. > > I am proposing three solutions to the problem. All of them have issues with > it - > > Solution 1. > Output the user logs on the user screen as part of the job submission > process. > > Merits- > This will prevent users from printing large amount of logs and the user can > get runtime feedback on what is wrong with his/her job. > > Issues - > This proposal will use the framework bandwidth while running jobs for the > user. The user logs will need to pass from the tasks to the tasktrackers, > from the tasktrackers to the jobtrackers and then from the jobtrackers to > the jobclient using a lot of framework bandwidth if the user is printing > out too much data. > > Solution 2. > Output the user logs onto a dfs directory and then concatenate these files. > Each task can create a file for the output in the log direcotyr for a given > user and jobid. > > Issues - > This will create a huge amount of small files in DFS which later can be > concatenated into a single file. Also there is this issue that who would > concatenate these files into a single file? This could be done by the > framework (jobtracker) as part of the cleanup for the jobs - might stress > the jobtracker. > > Solution 3. > Put the user logs into a seperate user log file in the log directory on > each tasktrackers. We can provide some tools to query these local log > files. We could have commands like for jobid j and for taskid t get me the > user log output. These tools could run as a seperate map reduce program > with each map grepping the user log files and a single recude aggregating > these logs in to a single dfs file. > > Issues- > This does sound like more work for the user. Also, the output might not be > complete since a tasktracker might have went down after it ran the job. > > Any thoughts?