Thanks. map.input.file is exactly what I need. One more question. Is there a way to ignore a file in an input path? So, for example, if the data in hadoop is stored in a directory structure /<date>/<machine>.txt. So let's say Dec 1, 2008, I have a file from machine a and b, I would have the following directory structure:
/20081201/a.txt /20081201/b.txt What I'd like to do is have a job that, depending on the configuration, would either process all files or files for a given machine only ( say a, but not b ). Is that possible to do or am I trying to do something that's using Hadoop in a way that it's not intended to be used? I looked briefly at MultipleInputs which seems to be able to handle different input paths, but not handle a single input path in different ways depending on filename. Thanks again. Andy -----Original Message----- From: Devaraj Das [mailto:[EMAIL PROTECTED] Sent: Sunday, December 07, 2008 12:11 PM To: [email protected] Subject: Re: Can mapper get access to filename being processed? On 12/7/08 11:32 PM, "Andy Sautins" <[EMAIL PROTECTED]> wrote: > > > I'm having trouble finding a way to do what I want, so I'm wondering > if I'm just not looking at the right place or if I'm thinking about the > problem in the wrong way. Any insight would be appreciated. > > > > Let's say I have a directory of files that contains a combination of > different file types. The MapReduce job needs to process all files in > the directory but generates different key/value pairs depending on the > file being processed. What I'd like to do is use the filename to > identify the file type being processed and use that information in the > map job. What it seems like what I'd want is the map job to have access > to the filename of the input file split being processed. I haven't been > able to find out if that is available to a derived class of > MapReduceBase. > > That's map.input.file available in the map via JobConf. The mapper class has to override the implementation of configure in MapReduceBase and get the filename via JobConf.get("map.input.file"). Store that in some field variable of your mapper class. You can then inspect that in your map method. > > Does what I'm trying to do make sense or is there a better way of > processing a job like the one I'm describing? > > Look at MultipleInputs class (in the mapred.lib directory). That could prove useful. > > Thank you > > > > Andy > > > > > > >
