RE: Can mapper get access to filename being processed?

Andy Sautins Sun, 07 Dec 2008 18:54:46 -0800

  Thanks.  map.input.file is exactly what I need. 

  One more question.  Is there a way to ignore a file in an input path?
So, for example, if the data in hadoop is stored in a directory
structure /<date>/<machine>.txt.  So let's say Dec 1, 2008, I have a
file from machine a and b, I would have the following directory
structure:


   /20081201/a.txt
   /20081201/b.txt

   What I'd like to do is have a job that, depending on the
configuration, would either process all files or files for a given
machine only ( say a, but not b ).  

   Is that possible to do or am I trying to do something that's using
Hadoop in a way that it's not intended to be used?  I looked briefly at
MultipleInputs which seems to be able to handle different input paths,
but not handle a single input path in different ways depending on
filename.

   Thanks again.

   Andy

-----Original Message-----
From: Devaraj Das [mailto:[EMAIL PROTECTED] 
Sent: Sunday, December 07, 2008 12:11 PM
To: [email protected]
Subject: Re: Can mapper get access to filename being processed?




On 12/7/08 11:32 PM, "Andy Sautins" <[EMAIL PROTECTED]> wrote:

>  
> 
>    I'm having trouble finding a way to do what I want, so I'm
wondering
> if I'm just not looking at the right place or if I'm thinking about
the
> problem in the wrong way.  Any insight would be appreciated.
> 
>  
> 
>    Let's say I have a directory of files that contains a combination
of
> different file types.  The MapReduce job needs to process all files in
> the directory but generates different key/value pairs depending on the
> file being processed.  What I'd like to do is use the filename to
> identify the file type being processed and use that information in the
> map job.  What it seems like what I'd want is the map job to have
access
> to the filename of the input file split being processed.  I haven't
been
> able to find out if that is available to a derived class of
> MapReduceBase.  
> 
> 
That's map.input.file available in the map via JobConf. The mapper class
has
to override the implementation of configure in MapReduceBase and get the
filename via JobConf.get("map.input.file"). Store that in some field
variable of your mapper class. You can then inspect that in your map
method.

> 
>    Does what I'm trying to do make sense or is there a better way of
> processing a job like the one I'm describing?
> 
>
Look at MultipleInputs class (in the mapred.lib directory). That could
prove
useful.  
> 
>    Thank you
> 
>  
> 
>    Andy
> 
>    
> 
>  
> 
>     
>

RE: Can mapper get access to filename being processed?

Reply via email to