Hi, I have data files each consisting of scads of very long records, each record being an XML doc in it's own right. These XML docs have a complex structure: something like this
<record> <sec1> <foo> <bar id="asd"><stuff><MORE STUFF></bar> <bar ... > </foo> </sec1> <sec2> </sec2> ... <secN> </secN> </record> (except without the line breaks) These are generated by another system, aggregated by flume and dumped into HDFS. Anyhoo ... I'd like to load up this entire thing into Hive tables. Logically, the <sec> sections fit reasonably well into individual tables and this matches with the sorts of reports and data mining we want to do over the data. To start with, writing Java code is not really on. While I speak several programming languages I am not fluent in Java or proficient in Java development, so I plan to do any map/reduce steps necessary using Streaming and Python. I've looked into this, and done some proof of concept work using Streaming/Python. I am fairly new to HDFS/Hadoop. One approach which would definitely work would be to run a streaming mapper-only job per table to be produced. Each streaming job produces a directory of part-xxx files, and we import these files into Hive. However the disadvantage it seems to me is that we have to process each data file multiple times, each time spitting out one "type" of Hive table. This seems a bit inefficient: what I think I really want is to have a single map/reduce job send it's output to different directories in HDFS. Perhaps by having a mapper send it's output to a different file depending on the key it was dealing with? Is this possible? Or do I have to just launch a map/reduce job (well, map only actually) for each output directory I want, and have each directory contain a single type of Hive table input. Thanks for any hints. Regards Liam