Re: Writing small files to one big file in hdfs

Mohit Anchlia Tue, 21 Feb 2012 19:31:33 -0800

Finally figured it out. I needed to use SequenceFileAstextInputFormat.
There is just lack of examples that makes it difficult when you start.


On Tue, Feb 21, 2012 at 4:50 PM, Mohit Anchlia <mohitanch...@gmail.com>wrote:

> It looks like in mapper values are coming as binary instead of Text. Is
> this expected from sequence file? I initially wrote SequenceFile with Text
> values.
>
>
> On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia <mohitanch...@gmail.com>wrote:
>
>> Need some more help. I wrote sequence file using below code but now when
>> I run mapreduce job I get "file.*java.lang.ClassCastException*:
>> org.apache.hadoop.io.LongWritable cannot be cast to
>> org.apache.hadoop.io.Text" even though I didn't use LongWritable when I
>> originally wrote to the sequence
>>
>> //Code to write to the sequence file. There is no LongWritable here
>>
>> org.apache.hadoop.io.Text key =
>> *new* org.apache.hadoop.io.Text();
>>
>> BufferedReader buffer =
>> *new* BufferedReader(*new* FileReader(filePath));
>>
>> String line =
>> *null*;
>>
>> org.apache.hadoop.io.Text value =
>> *new* org.apache.hadoop.io.Text();
>>
>> *try* {
>>
>> writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(),
>>
>> value.getClass(), SequenceFile.CompressionType.
>> *RECORD*);
>>
>> *int* i = 1;
>>
>> *long* timestamp=System.*currentTimeMillis*();
>>
>> *while* ((line = buffer.readLine()) != *null*) {
>>
>> key.set(String.*valueOf*(timestamp));
>>
>> value.set(line);
>>
>> writer.append(key, value);
>>
>> i++;
>>
>> }
>>
>>
>>   On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee <
>> arkoprovomukher...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I think the following link will help:
>>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
>>>
>>> Cheers
>>> Arko
>>>
>>> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia <mohitanch...@gmail.com
>>> >wrote:
>>>
>>> > Sorry may be it's something obvious but I was wondering when map or
>>> reduce
>>> > gets called what would be the class used for key and value? If I used
>>> > "org.apache.hadoop.io.Text
>>> > value = *new* org.apache.hadoop.io.Text();" would the map be called
>>> with
>>>  > Text class?
>>> >
>>> > public void map(LongWritable key, Text value, Context context) throws
>>> > IOException, InterruptedException {
>>> >
>>> >
>>> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee <
>>> > arkoprovomukher...@gmail.com> wrote:
>>> >
>>> > > Hi Mohit,
>>> > >
>>> > > I am not sure that I understand your question.
>>> > >
>>> > > But you can write into a file using:
>>> > > *BufferedWriter output = new BufferedWriter
>>> > > (new OutputStreamWriter(fs.create(my_path,true)));*
>>> > > *output.write(data);*
>>> > > *
>>> > > *
>>> > > Then you can pass that file as the input to your MapReduce program.
>>> > >
>>> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );*
>>> > >
>>> > > From inside your Map/Reduce methods, I think you should NOT be
>>> tinkering
>>> > > with the input / output paths of that Map/Reduce job.
>>> > > Cheers
>>> > > Arko
>>> > >
>>> > >
>>> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia <
>>> mohitanch...@gmail.com
>>> > > >wrote:
>>> > >
>>> > > > Thanks How does mapreduce work on sequence file? Is there an
>>> example I
>>> > > can
>>> > > > look at?
>>> > > >
>>> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee <
>>> > > > arkoprovomukher...@gmail.com> wrote:
>>> > > >
>>> > > > > Hi,
>>> > > > >
>>> > > > > Let's say all the smaller files are in the same directory.
>>> > > > >
>>> > > > > Then u can do:
>>> > > > >
>>> > > > > *BufferedWriter output = new BufferedWriter
>>> > > > > (newOutputStreamWriter(fs.create(output_path,
>>> > > > > true)));  // Output path*
>>> > > > >
>>> > > > > *FileStatus[] output_files = fs.listStatus(new
>>> Path(input_path));  //
>>> > > > Input
>>> > > > > directory*
>>> > > > >
>>> > > > > *for ( int i=0; i < output_files.length; i++ )  *
>>> > > > >
>>> > > > > *{*
>>> > > > >
>>> > > > > *   BufferedReader reader = new
>>> > > > >
>>> > >
>>> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath())));
>>> > > > > *
>>> > > > >
>>> > > > > *   String data;*
>>> > > > >
>>> > > > > *   data = reader.readLine();*
>>> > > > >
>>> > > > > *   while ( data != null ) *
>>> > > > >
>>> > > > > *  {*
>>> > > > >
>>> > > > > *        output.write(data);*
>>> > > > >
>>> > > > > *  }*
>>> > > > >
>>> > > > > *    reader.close*
>>> > > > >
>>> > > > > *}*
>>> > > > >
>>> > > > > *output.close*
>>> > > > >
>>> > > > >
>>> > > > > In case you have the files in multiple directories, call the
>>> code for
>>> > > > each
>>> > > > > of them with different input paths.
>>> > > > >
>>> > > > > Hope this helps!
>>> > > > >
>>> > > > > Cheers
>>> > > > >
>>> > > > > Arko
>>> > > > >
>>> > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <
>>> > mohitanch...@gmail.com
>>> > > > > >wrote:
>>> > > > >
>>> > > > > > I am trying to look for examples that demonstrates using
>>> sequence
>>> > > files
>>> > > > > > including writing to it and then running mapred on it, but
>>> unable
>>> > to
>>> > > > find
>>> > > > > > one. Could you please point me to some examples of sequence
>>> files?
>>> > > > > >
>>> > > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <
>>> bejoy.had...@gmail.com
>>> > >
>>> > > > > wrote:
>>> > > > > >
>>> > > > > > > Hi Mohit
>>> > > > > > >      AFAIK XMLLoader in pig won't be suited for Sequence
>>> Files.
>>> > > > Please
>>> > > > > > > post the same to Pig user group for some workaround over the
>>> > same.
>>> > > > > > >         SequenceFIle is a preferred option when we want to
>>> store
>>> > > > small
>>> > > > > > > files in hdfs and needs to be processed by MapReduce as it
>>> stores
>>> > > > data
>>> > > > > in
>>> > > > > > > key value format.Since SequenceFileInputFormat is available
>>> at
>>> > your
>>> > > > > > > disposal you don't need any custom input formats for
>>> processing
>>> > the
>>> > > > > same
>>> > > > > > > using map reduce. It is a cleaner and better approach
>>> compared to
>>> > > > just
>>> > > > > > > appending small xml file contents into a big file.
>>> > > > > > >
>>> > > > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <
>>> > > > > mohitanch...@gmail.com
>>> > > > > > > >wrote:
>>> > > > > > >
>>> > > > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <
>>> > > bejoy.had...@gmail.com>
>>> > > > > > > wrote:
>>> > > > > > > >
>>> > > > > > > > > Mohit
>>> > > > > > > > >       Rather than just appending the content into a
>>> normal
>>> > text
>>> > > > > file
>>> > > > > > or
>>> > > > > > > > > so, you can create a sequence file with the individual
>>> > smaller
>>> > > > file
>>> > > > > > > > content
>>> > > > > > > > > as values.
>>> > > > > > > > >
>>> > > > > > > > >  Thanks. I was planning to use pig's
>>> > > > > > > > org.apache.pig.piggybank.storage.XMLLoader
>>> > > > > > > > for processing. Would it work with sequence file?
>>> > > > > > > >
>>> > > > > > > > This text file that I was referring to would be in hdfs
>>> itself.
>>> > > Is
>>> > > > it
>>> > > > > > > still
>>> > > > > > > > different than using sequence file?
>>> > > > > > > >
>>> > > > > > > > > Regards
>>> > > > > > > > > Bejoy.K.S
>>> > > > > > > > >
>>> > > > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <
>>> > > > > > > mohitanch...@gmail.com
>>> > > > > > > > > >wrote:
>>> > > > > > > > >
>>> > > > > > > > > > We have small xml files. Currently I am planning to
>>> append
>>> > > > these
>>> > > > > > > small
>>> > > > > > > > > > files to one file in hdfs so that I can take advantage
>>> of
>>> > > > splits,
>>> > > > > > > > larger
>>> > > > > > > > > > blocks and sequential IO. What I am unsure is if it's
>>> ok to
>>> > > > > append
>>> > > > > > > one
>>> > > > > > > > > file
>>> > > > > > > > > > at a time to this hdfs file
>>> > > > > > > > > >
>>> > > > > > > > > > Could someone suggest if this is ok? Would like to
>>> know how
>>> > > > other
>>> > > > > > do
>>> > > > > > > > it.
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Writing small files to one big file in hdfs

Reply via email to