Finally figured it out. I needed to use SequenceFileAstextInputFormat. There is just lack of examples that makes it difficult when you start.
On Tue, Feb 21, 2012 at 4:50 PM, Mohit Anchlia <mohitanch...@gmail.com>wrote: > It looks like in mapper values are coming as binary instead of Text. Is > this expected from sequence file? I initially wrote SequenceFile with Text > values. > > > On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia <mohitanch...@gmail.com>wrote: > >> Need some more help. I wrote sequence file using below code but now when >> I run mapreduce job I get "file.*java.lang.ClassCastException*: >> org.apache.hadoop.io.LongWritable cannot be cast to >> org.apache.hadoop.io.Text" even though I didn't use LongWritable when I >> originally wrote to the sequence >> >> //Code to write to the sequence file. There is no LongWritable here >> >> org.apache.hadoop.io.Text key = >> *new* org.apache.hadoop.io.Text(); >> >> BufferedReader buffer = >> *new* BufferedReader(*new* FileReader(filePath)); >> >> String line = >> *null*; >> >> org.apache.hadoop.io.Text value = >> *new* org.apache.hadoop.io.Text(); >> >> *try* { >> >> writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(), >> >> value.getClass(), SequenceFile.CompressionType. >> *RECORD*); >> >> *int* i = 1; >> >> *long* timestamp=System.*currentTimeMillis*(); >> >> *while* ((line = buffer.readLine()) != *null*) { >> >> key.set(String.*valueOf*(timestamp)); >> >> value.set(line); >> >> writer.append(key, value); >> >> i++; >> >> } >> >> >> On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee < >> arkoprovomukher...@gmail.com> wrote: >> >>> Hi, >>> >>> I think the following link will help: >>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html >>> >>> Cheers >>> Arko >>> >>> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia <mohitanch...@gmail.com >>> >wrote: >>> >>> > Sorry may be it's something obvious but I was wondering when map or >>> reduce >>> > gets called what would be the class used for key and value? If I used >>> > "org.apache.hadoop.io.Text >>> > value = *new* org.apache.hadoop.io.Text();" would the map be called >>> with >>> > Text class? >>> > >>> > public void map(LongWritable key, Text value, Context context) throws >>> > IOException, InterruptedException { >>> > >>> > >>> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee < >>> > arkoprovomukher...@gmail.com> wrote: >>> > >>> > > Hi Mohit, >>> > > >>> > > I am not sure that I understand your question. >>> > > >>> > > But you can write into a file using: >>> > > *BufferedWriter output = new BufferedWriter >>> > > (new OutputStreamWriter(fs.create(my_path,true)));* >>> > > *output.write(data);* >>> > > * >>> > > * >>> > > Then you can pass that file as the input to your MapReduce program. >>> > > >>> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* >>> > > >>> > > From inside your Map/Reduce methods, I think you should NOT be >>> tinkering >>> > > with the input / output paths of that Map/Reduce job. >>> > > Cheers >>> > > Arko >>> > > >>> > > >>> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia < >>> mohitanch...@gmail.com >>> > > >wrote: >>> > > >>> > > > Thanks How does mapreduce work on sequence file? Is there an >>> example I >>> > > can >>> > > > look at? >>> > > > >>> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < >>> > > > arkoprovomukher...@gmail.com> wrote: >>> > > > >>> > > > > Hi, >>> > > > > >>> > > > > Let's say all the smaller files are in the same directory. >>> > > > > >>> > > > > Then u can do: >>> > > > > >>> > > > > *BufferedWriter output = new BufferedWriter >>> > > > > (newOutputStreamWriter(fs.create(output_path, >>> > > > > true))); // Output path* >>> > > > > >>> > > > > *FileStatus[] output_files = fs.listStatus(new >>> Path(input_path)); // >>> > > > Input >>> > > > > directory* >>> > > > > >>> > > > > *for ( int i=0; i < output_files.length; i++ ) * >>> > > > > >>> > > > > *{* >>> > > > > >>> > > > > * BufferedReader reader = new >>> > > > > >>> > > >>> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath()))); >>> > > > > * >>> > > > > >>> > > > > * String data;* >>> > > > > >>> > > > > * data = reader.readLine();* >>> > > > > >>> > > > > * while ( data != null ) * >>> > > > > >>> > > > > * {* >>> > > > > >>> > > > > * output.write(data);* >>> > > > > >>> > > > > * }* >>> > > > > >>> > > > > * reader.close* >>> > > > > >>> > > > > *}* >>> > > > > >>> > > > > *output.close* >>> > > > > >>> > > > > >>> > > > > In case you have the files in multiple directories, call the >>> code for >>> > > > each >>> > > > > of them with different input paths. >>> > > > > >>> > > > > Hope this helps! >>> > > > > >>> > > > > Cheers >>> > > > > >>> > > > > Arko >>> > > > > >>> > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia < >>> > mohitanch...@gmail.com >>> > > > > >wrote: >>> > > > > >>> > > > > > I am trying to look for examples that demonstrates using >>> sequence >>> > > files >>> > > > > > including writing to it and then running mapred on it, but >>> unable >>> > to >>> > > > find >>> > > > > > one. Could you please point me to some examples of sequence >>> files? >>> > > > > > >>> > > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks < >>> bejoy.had...@gmail.com >>> > > >>> > > > > wrote: >>> > > > > > >>> > > > > > > Hi Mohit >>> > > > > > > AFAIK XMLLoader in pig won't be suited for Sequence >>> Files. >>> > > > Please >>> > > > > > > post the same to Pig user group for some workaround over the >>> > same. >>> > > > > > > SequenceFIle is a preferred option when we want to >>> store >>> > > > small >>> > > > > > > files in hdfs and needs to be processed by MapReduce as it >>> stores >>> > > > data >>> > > > > in >>> > > > > > > key value format.Since SequenceFileInputFormat is available >>> at >>> > your >>> > > > > > > disposal you don't need any custom input formats for >>> processing >>> > the >>> > > > > same >>> > > > > > > using map reduce. It is a cleaner and better approach >>> compared to >>> > > > just >>> > > > > > > appending small xml file contents into a big file. >>> > > > > > > >>> > > > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia < >>> > > > > mohitanch...@gmail.com >>> > > > > > > >wrote: >>> > > > > > > >>> > > > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks < >>> > > bejoy.had...@gmail.com> >>> > > > > > > wrote: >>> > > > > > > > >>> > > > > > > > > Mohit >>> > > > > > > > > Rather than just appending the content into a >>> normal >>> > text >>> > > > > file >>> > > > > > or >>> > > > > > > > > so, you can create a sequence file with the individual >>> > smaller >>> > > > file >>> > > > > > > > content >>> > > > > > > > > as values. >>> > > > > > > > > >>> > > > > > > > > Thanks. I was planning to use pig's >>> > > > > > > > org.apache.pig.piggybank.storage.XMLLoader >>> > > > > > > > for processing. Would it work with sequence file? >>> > > > > > > > >>> > > > > > > > This text file that I was referring to would be in hdfs >>> itself. >>> > > Is >>> > > > it >>> > > > > > > still >>> > > > > > > > different than using sequence file? >>> > > > > > > > >>> > > > > > > > > Regards >>> > > > > > > > > Bejoy.K.S >>> > > > > > > > > >>> > > > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia < >>> > > > > > > mohitanch...@gmail.com >>> > > > > > > > > >wrote: >>> > > > > > > > > >>> > > > > > > > > > We have small xml files. Currently I am planning to >>> append >>> > > > these >>> > > > > > > small >>> > > > > > > > > > files to one file in hdfs so that I can take advantage >>> of >>> > > > splits, >>> > > > > > > > larger >>> > > > > > > > > > blocks and sequential IO. What I am unsure is if it's >>> ok to >>> > > > > append >>> > > > > > > one >>> > > > > > > > > file >>> > > > > > > > > > at a time to this hdfs file >>> > > > > > > > > > >>> > > > > > > > > > Could someone suggest if this is ok? Would like to >>> know how >>> > > > other >>> > > > > > do >>> > > > > > > > it. >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >> >> >