Multiple input formats and multiple output formats in Hadoop 0.20.2

Jian Fang Wed, 10 Aug 2011 09:09:31 -0700

Hi,

I am working on a project, which requires multiple input formats and
multiple output formats. Basically, I store some sales rank data to a
Cassandra cluster and I get a sales rank update file each day to update the
ranks in the Cassandra. In the meanwhile, I need to find all the products
whose rank change exceeds a threshold and output them to a file. That is to
say, I need two input formats, one from the file system (sales rank update
file) and one from the Cassandra (current sales rank), and two output
formats, one to the file system (products whose rank change exceeds a
threshold) and one to Cassandra (write the new rank to Cassandra).


Right now, I used multiple cascading jobs to do the work and use HDFS to
share data among jobs. But this is not very efficient since some
intermediate files need to be read multiple times in different jobs. I
wonder if there is a more elegant way to solve this problem. Seems Hadoop
0.19 supports multiple input/output formats. It would be great if I could
merge the multiple jobs to one with multiple input formats and multiple
output formats. Is this doable in Hadoop 0.20.2?  Are there any examples of
multiple input formats and multiple output formats for Hadoop 0.20.2?

Thanks in advance,

John

Multiple input formats and multiple output formats in Hadoop 0.20.2

Reply via email to