Hi, I looked at different file types, input and output formats, but got quite confused, and am not sure how to connect the pipe from one format to another.
Here is what I would like to do: 1. Pass in a string to my hadoop program, and it will write this single key-value pair to a file on the fly. 2. The first job will read from this file, do some processing, and write more key-value pairs to other files (the same format as the file in step 1). Subsequent jobs will read from those files generated by the first job. This will continue in an iterative manner until some terminal condition has reached. 3. Both the key and value in the file should be text (i.e. Human readable ascii). While this sounds simple, I have been having trouble figuring out the correct formats to use, and here is why: JobConf.setInputKeyClass, and setInputValueClass are both deprecated, so I am avoiding them. SequenceFileOutputFormat doesn't work because the key has to be IntWriteable and a Text key causes the code to blow up. (which I still dont quite understand why, because when I use a SequenceFile.Writer, it can take Text for both keys and values) KeyValueTextInputFormat looks promising, but I am not sure how to bootstrap the first file mentioned in step 1, i.e. what formats and writer I should use to create the file to hold the initial argument... I have a feeling that this is actually a very simple problem, only that I am not looking at the right direction. Your help would be greatly appreciated. -- Jim