Re: Running hadoop on directory structure

Harsh J Fri, 15 Feb 2013 11:07:48 -0800

You should be able to use
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
to achieve this. It accepts subdirectory creation (under main job
output directory). However, the special chars may be an issue (i.e. -,
=, etc.), for which you'll either need
https://issues.apache.org/jira/browse/MAPREDUCE-2293 or a custom hack
to bypass that inbuilt restriction.


Alternatively also look at
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
(not sure on subdir parts here, but worth checking out.). Note that
this class isn't present in the newer MR API (having been replaced by
aforementioned MultipleOutputs).

On Sat, Feb 16, 2013 at 12:11 AM, Max Lebedev <ma...@actionx.com> wrote:
> Hi, I am a CS undergraduate working with hadoop. I wrote a library to process
> logs, my input directory has the following structure:
>
> logs_hourly
> ├── dt=2013-02-15
> │   ├── ts=1360887451
> │   │   └── syslog-2013-02-15-1360887451.gz
> │   └── ts=1360891051
> │       └── syslog-2013-02-15-1360891051.gz
> ├── dt=2013-02-14
> │   ├── ts= 1360801050
> │   │   └── syslog-2013-02-14-1360801050.gz
> │   └── ts=1360804651
> │       └── syslog-2013-02-14-1360804651.gz
>
> Where dt is the day and ts is the hour when the log was created.
>
> Currently, the code takes an input directory (or a range of input
> directories) such as dt=2013-02-15 and goes through every file in every
> subdirectory sequentially with a loop. This process is slow and I think that
> running the code on the files in parallel would be more efficient. Is there
> any where that I could use Hadoop's MapReduce on a directory such as
> dt=2013-02-15 and receive the same directory structure as output?
>
> Thanks,
> Max Lebedev
>
>
>
> --
> View this message in context: 
> http://hadoop.6.n7.nabble.com/Running-hadoop-on-directory-structure-tp67904.html
> Sent from the common-user mailing list archive at Nabble.com.



--
Harsh J

Re: Running hadoop on directory structure

Reply via email to