You should be able to use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html to achieve this. It accepts subdirectory creation (under main job output directory). However, the special chars may be an issue (i.e. -, =, etc.), for which you'll either need https://issues.apache.org/jira/browse/MAPREDUCE-2293 or a custom hack to bypass that inbuilt restriction.
Alternatively also look at http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html (not sure on subdir parts here, but worth checking out.). Note that this class isn't present in the newer MR API (having been replaced by aforementioned MultipleOutputs). On Sat, Feb 16, 2013 at 12:11 AM, Max Lebedev <ma...@actionx.com> wrote: > Hi, I am a CS undergraduate working with hadoop. I wrote a library to process > logs, my input directory has the following structure: > > logs_hourly > ├── dt=2013-02-15 > │ ├── ts=1360887451 > │ │ └── syslog-2013-02-15-1360887451.gz > │ └── ts=1360891051 > │ └── syslog-2013-02-15-1360891051.gz > ├── dt=2013-02-14 > │ ├── ts= 1360801050 > │ │ └── syslog-2013-02-14-1360801050.gz > │ └── ts=1360804651 > │ └── syslog-2013-02-14-1360804651.gz > > Where dt is the day and ts is the hour when the log was created. > > Currently, the code takes an input directory (or a range of input > directories) such as dt=2013-02-15 and goes through every file in every > subdirectory sequentially with a loop. This process is slow and I think that > running the code on the files in parallel would be more efficient. Is there > any where that I could use Hadoop's MapReduce on a directory such as > dt=2013-02-15 and receive the same directory structure as output? > > Thanks, > Max Lebedev > > > > -- > View this message in context: > http://hadoop.6.n7.nabble.com/Running-hadoop-on-directory-structure-tp67904.html > Sent from the common-user mailing list archive at Nabble.com. -- Harsh J