Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
Finally figured it out. I needed to use SequenceFileAstextInputFormat.
There is just lack of examples that makes it difficult when you start.

On Tue, Feb 21, 2012 at 4:50 PM, Mohit Anchlia wrote:

> It looks like in mapper values are coming as binary instead of Text. Is
> this expected from sequence file? I initially wrote SequenceFile with Text
> values.
>
>
> On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia wrote:
>
>> Need some more help. I wrote sequence file using below code but now when
>> I run mapreduce job I get "file.*java.lang.ClassCastException*:
>> org.apache.hadoop.io.LongWritable cannot be cast to
>> org.apache.hadoop.io.Text" even though I didn't use LongWritable when I
>> originally wrote to the sequence
>>
>> //Code to write to the sequence file. There is no LongWritable here
>>
>> org.apache.hadoop.io.Text key =
>> *new* org.apache.hadoop.io.Text();
>>
>> BufferedReader buffer =
>> *new* BufferedReader(*new* FileReader(filePath));
>>
>> String line =
>> *null*;
>>
>> org.apache.hadoop.io.Text value =
>> *new* org.apache.hadoop.io.Text();
>>
>> *try* {
>>
>> writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(),
>>
>> value.getClass(), SequenceFile.CompressionType.
>> *RECORD*);
>>
>> *int* i = 1;
>>
>> *long* timestamp=System.*currentTimeMillis*();
>>
>> *while* ((line = buffer.readLine()) != *null*) {
>>
>> key.set(String.*valueOf*(timestamp));
>>
>> value.set(line);
>>
>> writer.append(key, value);
>>
>> i++;
>>
>> }
>>
>>
>>   On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee <
>> arkoprovomukher...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I think the following link will help:
>>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
>>>
>>> Cheers
>>> Arko
>>>
>>> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia >> >wrote:
>>>
>>> > Sorry may be it's something obvious but I was wondering when map or
>>> reduce
>>> > gets called what would be the class used for key and value? If I used
>>> > "org.apache.hadoop.io.Text
>>> > value = *new* org.apache.hadoop.io.Text();" would the map be called
>>> with
>>>  > Text class?
>>> >
>>> > public void map(LongWritable key, Text value, Context context) throws
>>> > IOException, InterruptedException {
>>> >
>>> >
>>> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee <
>>> > arkoprovomukher...@gmail.com> wrote:
>>> >
>>> > > Hi Mohit,
>>> > >
>>> > > I am not sure that I understand your question.
>>> > >
>>> > > But you can write into a file using:
>>> > > *BufferedWriter output = new BufferedWriter
>>> > > (new OutputStreamWriter(fs.create(my_path,true)));*
>>> > > *output.write(data);*
>>> > > *
>>> > > *
>>> > > Then you can pass that file as the input to your MapReduce program.
>>> > >
>>> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );*
>>> > >
>>> > > From inside your Map/Reduce methods, I think you should NOT be
>>> tinkering
>>> > > with the input / output paths of that Map/Reduce job.
>>> > > Cheers
>>> > > Arko
>>> > >
>>> > >
>>> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia <
>>> mohitanch...@gmail.com
>>> > > >wrote:
>>> > >
>>> > > > Thanks How does mapreduce work on sequence file? Is there an
>>> example I
>>> > > can
>>> > > > look at?
>>> > > >
>>> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee <
>>> > > > arkoprovomukher...@gmail.com> wrote:
>>> > > >
>>> > > > > Hi,
>>> > > > >
>>> > > > > Let's say all the smaller files are in the same directory.
>>> > > > >
>>> > > > > Then u can do:
>>> > > > >
>>> > > > > *BufferedWriter output = new BufferedWriter
>>> > > > > (newOutputStreamWriter(fs.create(output_path,
>>> > > > > true)));  // Output path*
>>> > > > >
>>> > > > > *FileStatus[] output_files = fs.listStatus(new
>>> Path(input_path));  //
>>> > > > Input
>>> > > > > directory*
>>> > > > >
>>> > > > > *for ( int i=0; i < output_files.length; i++ )  *
>>> > > > >
>>> > > > > *{*
>>> > > > >
>>> > > > > *   BufferedReader reader = new
>>> > > > >
>>> > >
>>> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(;
>>> > > > > *
>>> > > > >
>>> > > > > *   String data;*
>>> > > > >
>>> > > > > *   data = reader.readLine();*
>>> > > > >
>>> > > > > *   while ( data != null ) *
>>> > > > >
>>> > > > > *  {*
>>> > > > >
>>> > > > > *output.write(data);*
>>> > > > >
>>> > > > > *  }*
>>> > > > >
>>> > > > > *reader.close*
>>> > > > >
>>> > > > > *}*
>>> > > > >
>>> > > > > *output.close*
>>> > > > >
>>> > > > >
>>> > > > > In case you have the files in multiple directories, call the
>>> code for
>>> > > > each
>>> > > > > of them with different input paths.
>>> > > > >
>>> > > > > Hope this helps!
>>> > > > >
>>> > > > > Cheers
>>> > > > >
>>> > > > > Arko
>>> > > > >
>>> > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <
>>> > mohitanch...@gmail.com
>>> > > > > >wrote:
>>> > > > >
>>> > > > > > I am trying to look for examples that demonstrates using
>>> sequence
>>> > > files
>>> > > > > > including writing to it and then running 

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Edward Capriolo
On Tue, Feb 21, 2012 at 7:50 PM, Mohit Anchlia  wrote:
> It looks like in mapper values are coming as binary instead of Text. Is
> this expected from sequence file? I initially wrote SequenceFile with Text
> values.
>
> On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia wrote:
>
>> Need some more help. I wrote sequence file using below code but now when I
>> run mapreduce job I get "file.*java.lang.ClassCastException*:
>> org.apache.hadoop.io.LongWritable cannot be cast to
>> org.apache.hadoop.io.Text" even though I didn't use LongWritable when I
>> originally wrote to the sequence
>>
>> //Code to write to the sequence file. There is no LongWritable here
>>
>> org.apache.hadoop.io.Text key =
>> *new* org.apache.hadoop.io.Text();
>>
>> BufferedReader buffer =
>> *new* BufferedReader(*new* FileReader(filePath));
>>
>> String line =
>> *null*;
>>
>> org.apache.hadoop.io.Text value =
>> *new* org.apache.hadoop.io.Text();
>>
>> *try* {
>>
>> writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(),
>>
>> value.getClass(), SequenceFile.CompressionType.
>> *RECORD*);
>>
>> *int* i = 1;
>>
>> *long* timestamp=System.*currentTimeMillis*();
>>
>> *while* ((line = buffer.readLine()) != *null*) {
>>
>> key.set(String.*valueOf*(timestamp));
>>
>> value.set(line);
>>
>> writer.append(key, value);
>>
>> i++;
>>
>> }
>>
>>
>>   On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee <
>> arkoprovomukher...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I think the following link will help:
>>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
>>>
>>> Cheers
>>> Arko
>>>
>>> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia >> >wrote:
>>>
>>> > Sorry may be it's something obvious but I was wondering when map or
>>> reduce
>>> > gets called what would be the class used for key and value? If I used
>>> > "org.apache.hadoop.io.Text
>>> > value = *new* org.apache.hadoop.io.Text();" would the map be called with
>>>  > Text class?
>>> >
>>> > public void map(LongWritable key, Text value, Context context) throws
>>> > IOException, InterruptedException {
>>> >
>>> >
>>> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee <
>>> > arkoprovomukher...@gmail.com> wrote:
>>> >
>>> > > Hi Mohit,
>>> > >
>>> > > I am not sure that I understand your question.
>>> > >
>>> > > But you can write into a file using:
>>> > > *BufferedWriter output = new BufferedWriter
>>> > > (new OutputStreamWriter(fs.create(my_path,true)));*
>>> > > *output.write(data);*
>>> > > *
>>> > > *
>>> > > Then you can pass that file as the input to your MapReduce program.
>>> > >
>>> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );*
>>> > >
>>> > > From inside your Map/Reduce methods, I think you should NOT be
>>> tinkering
>>> > > with the input / output paths of that Map/Reduce job.
>>> > > Cheers
>>> > > Arko
>>> > >
>>> > >
>>> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia <
>>> mohitanch...@gmail.com
>>> > > >wrote:
>>> > >
>>> > > > Thanks How does mapreduce work on sequence file? Is there an
>>> example I
>>> > > can
>>> > > > look at?
>>> > > >
>>> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee <
>>> > > > arkoprovomukher...@gmail.com> wrote:
>>> > > >
>>> > > > > Hi,
>>> > > > >
>>> > > > > Let's say all the smaller files are in the same directory.
>>> > > > >
>>> > > > > Then u can do:
>>> > > > >
>>> > > > > *BufferedWriter output = new BufferedWriter
>>> > > > > (newOutputStreamWriter(fs.create(output_path,
>>> > > > > true)));  // Output path*
>>> > > > >
>>> > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path));
>>>  //
>>> > > > Input
>>> > > > > directory*
>>> > > > >
>>> > > > > *for ( int i=0; i < output_files.length; i++ )  *
>>> > > > >
>>> > > > > *{*
>>> > > > >
>>> > > > > *   BufferedReader reader = new
>>> > > > >
>>> > >
>>> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(;
>>> > > > > *
>>> > > > >
>>> > > > > *   String data;*
>>> > > > >
>>> > > > > *   data = reader.readLine();*
>>> > > > >
>>> > > > > *   while ( data != null ) *
>>> > > > >
>>> > > > > *  {*
>>> > > > >
>>> > > > > *        output.write(data);*
>>> > > > >
>>> > > > > *  }*
>>> > > > >
>>> > > > > *    reader.close*
>>> > > > >
>>> > > > > *}*
>>> > > > >
>>> > > > > *output.close*
>>> > > > >
>>> > > > >
>>> > > > > In case you have the files in multiple directories, call the code
>>> for
>>> > > > each
>>> > > > > of them with different input paths.
>>> > > > >
>>> > > > > Hope this helps!
>>> > > > >
>>> > > > > Cheers
>>> > > > >
>>> > > > > Arko
>>> > > > >
>>> > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <
>>> > mohitanch...@gmail.com
>>> > > > > >wrote:
>>> > > > >
>>> > > > > > I am trying to look for examples that demonstrates using
>>> sequence
>>> > > files
>>> > > > > > including writing to it and then running mapred on it, but
>>> unable
>>> > to
>>> > > > find
>>> > > > > > one. Could you please point me to some examples of sequence
>>> files?
>>> > > > >

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
It looks like in mapper values are coming as binary instead of Text. Is
this expected from sequence file? I initially wrote SequenceFile with Text
values.

On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia wrote:

> Need some more help. I wrote sequence file using below code but now when I
> run mapreduce job I get "file.*java.lang.ClassCastException*:
> org.apache.hadoop.io.LongWritable cannot be cast to
> org.apache.hadoop.io.Text" even though I didn't use LongWritable when I
> originally wrote to the sequence
>
> //Code to write to the sequence file. There is no LongWritable here
>
> org.apache.hadoop.io.Text key =
> *new* org.apache.hadoop.io.Text();
>
> BufferedReader buffer =
> *new* BufferedReader(*new* FileReader(filePath));
>
> String line =
> *null*;
>
> org.apache.hadoop.io.Text value =
> *new* org.apache.hadoop.io.Text();
>
> *try* {
>
> writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(),
>
> value.getClass(), SequenceFile.CompressionType.
> *RECORD*);
>
> *int* i = 1;
>
> *long* timestamp=System.*currentTimeMillis*();
>
> *while* ((line = buffer.readLine()) != *null*) {
>
> key.set(String.*valueOf*(timestamp));
>
> value.set(line);
>
> writer.append(key, value);
>
> i++;
>
> }
>
>
>   On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee <
> arkoprovomukher...@gmail.com> wrote:
>
>> Hi,
>>
>> I think the following link will help:
>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
>>
>> Cheers
>> Arko
>>
>> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia > >wrote:
>>
>> > Sorry may be it's something obvious but I was wondering when map or
>> reduce
>> > gets called what would be the class used for key and value? If I used
>> > "org.apache.hadoop.io.Text
>> > value = *new* org.apache.hadoop.io.Text();" would the map be called with
>>  > Text class?
>> >
>> > public void map(LongWritable key, Text value, Context context) throws
>> > IOException, InterruptedException {
>> >
>> >
>> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee <
>> > arkoprovomukher...@gmail.com> wrote:
>> >
>> > > Hi Mohit,
>> > >
>> > > I am not sure that I understand your question.
>> > >
>> > > But you can write into a file using:
>> > > *BufferedWriter output = new BufferedWriter
>> > > (new OutputStreamWriter(fs.create(my_path,true)));*
>> > > *output.write(data);*
>> > > *
>> > > *
>> > > Then you can pass that file as the input to your MapReduce program.
>> > >
>> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );*
>> > >
>> > > From inside your Map/Reduce methods, I think you should NOT be
>> tinkering
>> > > with the input / output paths of that Map/Reduce job.
>> > > Cheers
>> > > Arko
>> > >
>> > >
>> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia <
>> mohitanch...@gmail.com
>> > > >wrote:
>> > >
>> > > > Thanks How does mapreduce work on sequence file? Is there an
>> example I
>> > > can
>> > > > look at?
>> > > >
>> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee <
>> > > > arkoprovomukher...@gmail.com> wrote:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > Let's say all the smaller files are in the same directory.
>> > > > >
>> > > > > Then u can do:
>> > > > >
>> > > > > *BufferedWriter output = new BufferedWriter
>> > > > > (newOutputStreamWriter(fs.create(output_path,
>> > > > > true)));  // Output path*
>> > > > >
>> > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path));
>>  //
>> > > > Input
>> > > > > directory*
>> > > > >
>> > > > > *for ( int i=0; i < output_files.length; i++ )  *
>> > > > >
>> > > > > *{*
>> > > > >
>> > > > > *   BufferedReader reader = new
>> > > > >
>> > >
>> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(;
>> > > > > *
>> > > > >
>> > > > > *   String data;*
>> > > > >
>> > > > > *   data = reader.readLine();*
>> > > > >
>> > > > > *   while ( data != null ) *
>> > > > >
>> > > > > *  {*
>> > > > >
>> > > > > *output.write(data);*
>> > > > >
>> > > > > *  }*
>> > > > >
>> > > > > *reader.close*
>> > > > >
>> > > > > *}*
>> > > > >
>> > > > > *output.close*
>> > > > >
>> > > > >
>> > > > > In case you have the files in multiple directories, call the code
>> for
>> > > > each
>> > > > > of them with different input paths.
>> > > > >
>> > > > > Hope this helps!
>> > > > >
>> > > > > Cheers
>> > > > >
>> > > > > Arko
>> > > > >
>> > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <
>> > mohitanch...@gmail.com
>> > > > > >wrote:
>> > > > >
>> > > > > > I am trying to look for examples that demonstrates using
>> sequence
>> > > files
>> > > > > > including writing to it and then running mapred on it, but
>> unable
>> > to
>> > > > find
>> > > > > > one. Could you please point me to some examples of sequence
>> files?
>> > > > > >
>> > > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <
>> bejoy.had...@gmail.com
>> > >
>> > > > > wrote:
>> > > > > >
>> > > > > > > Hi Mohit
>> > > > > > >  AFAIK XMLLoader in pig won't be suited for Sequence
>> Files.
>> > > >

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
Need some more help. I wrote sequence file using below code but now when I
run mapreduce job I get "file.*java.lang.ClassCastException*:
org.apache.hadoop.io.LongWritable cannot be cast to
org.apache.hadoop.io.Text" even though I didn't use LongWritable when I
originally wrote to the sequence

//Code to write to the sequence file. There is no LongWritable here

org.apache.hadoop.io.Text key = *new* org.apache.hadoop.io.Text();

BufferedReader buffer = *new* BufferedReader(*new* FileReader(filePath));

String line = *null*;

org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text();

*try* {

writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(),

value.getClass(), SequenceFile.CompressionType.*RECORD*);

*int* i = 1;

*long* timestamp=System.*currentTimeMillis*();

*while* ((line = buffer.readLine()) != *null*) {

key.set(String.*valueOf*(timestamp));

value.set(line);

writer.append(key, value);

i++;

}


On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee <
arkoprovomukher...@gmail.com> wrote:

> Hi,
>
> I think the following link will help:
> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
>
> Cheers
> Arko
>
> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia  >wrote:
>
> > Sorry may be it's something obvious but I was wondering when map or
> reduce
> > gets called what would be the class used for key and value? If I used
> > "org.apache.hadoop.io.Text
> > value = *new* org.apache.hadoop.io.Text();" would the map be called with
>  > Text class?
> >
> > public void map(LongWritable key, Text value, Context context) throws
> > IOException, InterruptedException {
> >
> >
> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee <
> > arkoprovomukher...@gmail.com> wrote:
> >
> > > Hi Mohit,
> > >
> > > I am not sure that I understand your question.
> > >
> > > But you can write into a file using:
> > > *BufferedWriter output = new BufferedWriter
> > > (new OutputStreamWriter(fs.create(my_path,true)));*
> > > *output.write(data);*
> > > *
> > > *
> > > Then you can pass that file as the input to your MapReduce program.
> > >
> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );*
> > >
> > > From inside your Map/Reduce methods, I think you should NOT be
> tinkering
> > > with the input / output paths of that Map/Reduce job.
> > > Cheers
> > > Arko
> > >
> > >
> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia  > > >wrote:
> > >
> > > > Thanks How does mapreduce work on sequence file? Is there an example
> I
> > > can
> > > > look at?
> > > >
> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee <
> > > > arkoprovomukher...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Let's say all the smaller files are in the same directory.
> > > > >
> > > > > Then u can do:
> > > > >
> > > > > *BufferedWriter output = new BufferedWriter
> > > > > (newOutputStreamWriter(fs.create(output_path,
> > > > > true)));  // Output path*
> > > > >
> > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path));
>  //
> > > > Input
> > > > > directory*
> > > > >
> > > > > *for ( int i=0; i < output_files.length; i++ )  *
> > > > >
> > > > > *{*
> > > > >
> > > > > *   BufferedReader reader = new
> > > > >
> > >
> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(;
> > > > > *
> > > > >
> > > > > *   String data;*
> > > > >
> > > > > *   data = reader.readLine();*
> > > > >
> > > > > *   while ( data != null ) *
> > > > >
> > > > > *  {*
> > > > >
> > > > > *output.write(data);*
> > > > >
> > > > > *  }*
> > > > >
> > > > > *reader.close*
> > > > >
> > > > > *}*
> > > > >
> > > > > *output.close*
> > > > >
> > > > >
> > > > > In case you have the files in multiple directories, call the code
> for
> > > > each
> > > > > of them with different input paths.
> > > > >
> > > > > Hope this helps!
> > > > >
> > > > > Cheers
> > > > >
> > > > > Arko
> > > > >
> > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <
> > mohitanch...@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > I am trying to look for examples that demonstrates using sequence
> > > files
> > > > > > including writing to it and then running mapred on it, but unable
> > to
> > > > find
> > > > > > one. Could you please point me to some examples of sequence
> files?
> > > > > >
> > > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <
> bejoy.had...@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Mohit
> > > > > > >  AFAIK XMLLoader in pig won't be suited for Sequence Files.
> > > > Please
> > > > > > > post the same to Pig user group for some workaround over the
> > same.
> > > > > > > SequenceFIle is a preferred option when we want to
> store
> > > > small
> > > > > > > files in hdfs and needs to be processed by MapReduce as it
> stores
> > > > data
> > > > > in
> > > > > > > key value format.Since SequenceFileInputFormat is available at
> > your
> > > > > > > disposal you don't need any custom input formats for processing

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Arko Provo Mukherjee
Hi,

I think the following link will help:
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

Cheers
Arko

On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia wrote:

> Sorry may be it's something obvious but I was wondering when map or reduce
> gets called what would be the class used for key and value? If I used
> "org.apache.hadoop.io.Text
> value = *new* org.apache.hadoop.io.Text();" would the map be called with
> Text class?
>
> public void map(LongWritable key, Text value, Context context) throws
> IOException, InterruptedException {
>
>
> On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee <
> arkoprovomukher...@gmail.com> wrote:
>
> > Hi Mohit,
> >
> > I am not sure that I understand your question.
> >
> > But you can write into a file using:
> > *BufferedWriter output = new BufferedWriter
> > (new OutputStreamWriter(fs.create(my_path,true)));*
> > *output.write(data);*
> > *
> > *
> > Then you can pass that file as the input to your MapReduce program.
> >
> > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );*
> >
> > From inside your Map/Reduce methods, I think you should NOT be tinkering
> > with the input / output paths of that Map/Reduce job.
> > Cheers
> > Arko
> >
> >
> > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia  > >wrote:
> >
> > > Thanks How does mapreduce work on sequence file? Is there an example I
> > can
> > > look at?
> > >
> > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee <
> > > arkoprovomukher...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Let's say all the smaller files are in the same directory.
> > > >
> > > > Then u can do:
> > > >
> > > > *BufferedWriter output = new BufferedWriter
> > > > (newOutputStreamWriter(fs.create(output_path,
> > > > true)));  // Output path*
> > > >
> > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path));  //
> > > Input
> > > > directory*
> > > >
> > > > *for ( int i=0; i < output_files.length; i++ )  *
> > > >
> > > > *{*
> > > >
> > > > *   BufferedReader reader = new
> > > >
> > BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(;
> > > > *
> > > >
> > > > *   String data;*
> > > >
> > > > *   data = reader.readLine();*
> > > >
> > > > *   while ( data != null ) *
> > > >
> > > > *  {*
> > > >
> > > > *output.write(data);*
> > > >
> > > > *  }*
> > > >
> > > > *reader.close*
> > > >
> > > > *}*
> > > >
> > > > *output.close*
> > > >
> > > >
> > > > In case you have the files in multiple directories, call the code for
> > > each
> > > > of them with different input paths.
> > > >
> > > > Hope this helps!
> > > >
> > > > Cheers
> > > >
> > > > Arko
> > > >
> > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <
> mohitanch...@gmail.com
> > > > >wrote:
> > > >
> > > > > I am trying to look for examples that demonstrates using sequence
> > files
> > > > > including writing to it and then running mapred on it, but unable
> to
> > > find
> > > > > one. Could you please point me to some examples of sequence files?
> > > > >
> > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks  >
> > > > wrote:
> > > > >
> > > > > > Hi Mohit
> > > > > >  AFAIK XMLLoader in pig won't be suited for Sequence Files.
> > > Please
> > > > > > post the same to Pig user group for some workaround over the
> same.
> > > > > > SequenceFIle is a preferred option when we want to store
> > > small
> > > > > > files in hdfs and needs to be processed by MapReduce as it stores
> > > data
> > > > in
> > > > > > key value format.Since SequenceFileInputFormat is available at
> your
> > > > > > disposal you don't need any custom input formats for processing
> the
> > > > same
> > > > > > using map reduce. It is a cleaner and better approach compared to
> > > just
> > > > > > appending small xml file contents into a big file.
> > > > > >
> > > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <
> > > > mohitanch...@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <
> > bejoy.had...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Mohit
> > > > > > > >   Rather than just appending the content into a normal
> text
> > > > file
> > > > > or
> > > > > > > > so, you can create a sequence file with the individual
> smaller
> > > file
> > > > > > > content
> > > > > > > > as values.
> > > > > > > >
> > > > > > > >  Thanks. I was planning to use pig's
> > > > > > > org.apache.pig.piggybank.storage.XMLLoader
> > > > > > > for processing. Would it work with sequence file?
> > > > > > >
> > > > > > > This text file that I was referring to would be in hdfs itself.
> > Is
> > > it
> > > > > > still
> > > > > > > different than using sequence file?
> > > > > > >
> > > > > > > > Regards
> > > > > > > > Bejoy.K.S
> > > > > > > >
> > > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <
> > > > > > mohitanch...@gmail.com
> > > > > > > > >wrote:
> > > > > > > >
> > > > > > > > > We have small xml files. Currently I am plannin

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
Sorry may be it's something obvious but I was wondering when map or reduce
gets called what would be the class used for key and value? If I used
"org.apache.hadoop.io.Text
value = *new* org.apache.hadoop.io.Text();" would the map be called with
Text class?

public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {


On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee <
arkoprovomukher...@gmail.com> wrote:

> Hi Mohit,
>
> I am not sure that I understand your question.
>
> But you can write into a file using:
> *BufferedWriter output = new BufferedWriter
> (new OutputStreamWriter(fs.create(my_path,true)));*
> *output.write(data);*
> *
> *
> Then you can pass that file as the input to your MapReduce program.
>
> *FileInputFormat.addInputPath(jobconf, new Path (my_path) );*
>
> From inside your Map/Reduce methods, I think you should NOT be tinkering
> with the input / output paths of that Map/Reduce job.
> Cheers
> Arko
>
>
> On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia  >wrote:
>
> > Thanks How does mapreduce work on sequence file? Is there an example I
> can
> > look at?
> >
> > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee <
> > arkoprovomukher...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Let's say all the smaller files are in the same directory.
> > >
> > > Then u can do:
> > >
> > > *BufferedWriter output = new BufferedWriter
> > > (newOutputStreamWriter(fs.create(output_path,
> > > true)));  // Output path*
> > >
> > > *FileStatus[] output_files = fs.listStatus(new Path(input_path));  //
> > Input
> > > directory*
> > >
> > > *for ( int i=0; i < output_files.length; i++ )  *
> > >
> > > *{*
> > >
> > > *   BufferedReader reader = new
> > >
> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(;
> > > *
> > >
> > > *   String data;*
> > >
> > > *   data = reader.readLine();*
> > >
> > > *   while ( data != null ) *
> > >
> > > *  {*
> > >
> > > *output.write(data);*
> > >
> > > *  }*
> > >
> > > *reader.close*
> > >
> > > *}*
> > >
> > > *output.close*
> > >
> > >
> > > In case you have the files in multiple directories, call the code for
> > each
> > > of them with different input paths.
> > >
> > > Hope this helps!
> > >
> > > Cheers
> > >
> > > Arko
> > >
> > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia  > > >wrote:
> > >
> > > > I am trying to look for examples that demonstrates using sequence
> files
> > > > including writing to it and then running mapred on it, but unable to
> > find
> > > > one. Could you please point me to some examples of sequence files?
> > > >
> > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks 
> > > wrote:
> > > >
> > > > > Hi Mohit
> > > > >  AFAIK XMLLoader in pig won't be suited for Sequence Files.
> > Please
> > > > > post the same to Pig user group for some workaround over the same.
> > > > > SequenceFIle is a preferred option when we want to store
> > small
> > > > > files in hdfs and needs to be processed by MapReduce as it stores
> > data
> > > in
> > > > > key value format.Since SequenceFileInputFormat is available at your
> > > > > disposal you don't need any custom input formats for processing the
> > > same
> > > > > using map reduce. It is a cleaner and better approach compared to
> > just
> > > > > appending small xml file contents into a big file.
> > > > >
> > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <
> > > mohitanch...@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <
> bejoy.had...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Mohit
> > > > > > >   Rather than just appending the content into a normal text
> > > file
> > > > or
> > > > > > > so, you can create a sequence file with the individual smaller
> > file
> > > > > > content
> > > > > > > as values.
> > > > > > >
> > > > > > >  Thanks. I was planning to use pig's
> > > > > > org.apache.pig.piggybank.storage.XMLLoader
> > > > > > for processing. Would it work with sequence file?
> > > > > >
> > > > > > This text file that I was referring to would be in hdfs itself.
> Is
> > it
> > > > > still
> > > > > > different than using sequence file?
> > > > > >
> > > > > > > Regards
> > > > > > > Bejoy.K.S
> > > > > > >
> > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <
> > > > > mohitanch...@gmail.com
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > We have small xml files. Currently I am planning to append
> > these
> > > > > small
> > > > > > > > files to one file in hdfs so that I can take advantage of
> > splits,
> > > > > > larger
> > > > > > > > blocks and sequential IO. What I am unsure is if it's ok to
> > > append
> > > > > one
> > > > > > > file
> > > > > > > > at a time to this hdfs file
> > > > > > > >
> > > > > > > > Could someone suggest if this is ok? Would like to know how
> > other
> > > > do
> > > > > > it.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Writing small files to one big file in hdfs

2012-02-21 Thread Arko Provo Mukherjee
Hi Mohit,

I am not sure that I understand your question.

But you can write into a file using:
*BufferedWriter output = new BufferedWriter
(new OutputStreamWriter(fs.create(my_path,true)));*
*output.write(data);*
*
*
Then you can pass that file as the input to your MapReduce program.

*FileInputFormat.addInputPath(jobconf, new Path (my_path) );*

>From inside your Map/Reduce methods, I think you should NOT be tinkering
with the input / output paths of that Map/Reduce job.
Cheers
Arko


On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia wrote:

> Thanks How does mapreduce work on sequence file? Is there an example I can
> look at?
>
> On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee <
> arkoprovomukher...@gmail.com> wrote:
>
> > Hi,
> >
> > Let's say all the smaller files are in the same directory.
> >
> > Then u can do:
> >
> > *BufferedWriter output = new BufferedWriter
> > (newOutputStreamWriter(fs.create(output_path,
> > true)));  // Output path*
> >
> > *FileStatus[] output_files = fs.listStatus(new Path(input_path));  //
> Input
> > directory*
> >
> > *for ( int i=0; i < output_files.length; i++ )  *
> >
> > *{*
> >
> > *   BufferedReader reader = new
> > BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(;
> > *
> >
> > *   String data;*
> >
> > *   data = reader.readLine();*
> >
> > *   while ( data != null ) *
> >
> > *  {*
> >
> > *output.write(data);*
> >
> > *  }*
> >
> > *reader.close*
> >
> > *}*
> >
> > *output.close*
> >
> >
> > In case you have the files in multiple directories, call the code for
> each
> > of them with different input paths.
> >
> > Hope this helps!
> >
> > Cheers
> >
> > Arko
> >
> > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia  > >wrote:
> >
> > > I am trying to look for examples that demonstrates using sequence files
> > > including writing to it and then running mapred on it, but unable to
> find
> > > one. Could you please point me to some examples of sequence files?
> > >
> > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks 
> > wrote:
> > >
> > > > Hi Mohit
> > > >  AFAIK XMLLoader in pig won't be suited for Sequence Files.
> Please
> > > > post the same to Pig user group for some workaround over the same.
> > > > SequenceFIle is a preferred option when we want to store
> small
> > > > files in hdfs and needs to be processed by MapReduce as it stores
> data
> > in
> > > > key value format.Since SequenceFileInputFormat is available at your
> > > > disposal you don't need any custom input formats for processing the
> > same
> > > > using map reduce. It is a cleaner and better approach compared to
> just
> > > > appending small xml file contents into a big file.
> > > >
> > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <
> > mohitanch...@gmail.com
> > > > >wrote:
> > > >
> > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks 
> > > > wrote:
> > > > >
> > > > > > Mohit
> > > > > >   Rather than just appending the content into a normal text
> > file
> > > or
> > > > > > so, you can create a sequence file with the individual smaller
> file
> > > > > content
> > > > > > as values.
> > > > > >
> > > > > >  Thanks. I was planning to use pig's
> > > > > org.apache.pig.piggybank.storage.XMLLoader
> > > > > for processing. Would it work with sequence file?
> > > > >
> > > > > This text file that I was referring to would be in hdfs itself. Is
> it
> > > > still
> > > > > different than using sequence file?
> > > > >
> > > > > > Regards
> > > > > > Bejoy.K.S
> > > > > >
> > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <
> > > > mohitanch...@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > We have small xml files. Currently I am planning to append
> these
> > > > small
> > > > > > > files to one file in hdfs so that I can take advantage of
> splits,
> > > > > larger
> > > > > > > blocks and sequential IO. What I am unsure is if it's ok to
> > append
> > > > one
> > > > > > file
> > > > > > > at a time to this hdfs file
> > > > > > >
> > > > > > > Could someone suggest if this is ok? Would like to know how
> other
> > > do
> > > > > it.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
Thanks How does mapreduce work on sequence file? Is there an example I can
look at?

On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee <
arkoprovomukher...@gmail.com> wrote:

> Hi,
>
> Let's say all the smaller files are in the same directory.
>
> Then u can do:
>
> *BufferedWriter output = new BufferedWriter
> (newOutputStreamWriter(fs.create(output_path,
> true)));  // Output path*
>
> *FileStatus[] output_files = fs.listStatus(new Path(input_path));  // Input
> directory*
>
> *for ( int i=0; i < output_files.length; i++ )  *
>
> *{*
>
> *   BufferedReader reader = new
> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(;
> *
>
> *   String data;*
>
> *   data = reader.readLine();*
>
> *   while ( data != null ) *
>
> *  {*
>
> *output.write(data);*
>
> *  }*
>
> *reader.close*
>
> *}*
>
> *output.close*
>
>
> In case you have the files in multiple directories, call the code for each
> of them with different input paths.
>
> Hope this helps!
>
> Cheers
>
> Arko
>
> On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia  >wrote:
>
> > I am trying to look for examples that demonstrates using sequence files
> > including writing to it and then running mapred on it, but unable to find
> > one. Could you please point me to some examples of sequence files?
> >
> > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks 
> wrote:
> >
> > > Hi Mohit
> > >  AFAIK XMLLoader in pig won't be suited for Sequence Files. Please
> > > post the same to Pig user group for some workaround over the same.
> > > SequenceFIle is a preferred option when we want to store small
> > > files in hdfs and needs to be processed by MapReduce as it stores data
> in
> > > key value format.Since SequenceFileInputFormat is available at your
> > > disposal you don't need any custom input formats for processing the
> same
> > > using map reduce. It is a cleaner and better approach compared to just
> > > appending small xml file contents into a big file.
> > >
> > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <
> mohitanch...@gmail.com
> > > >wrote:
> > >
> > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks 
> > > wrote:
> > > >
> > > > > Mohit
> > > > >   Rather than just appending the content into a normal text
> file
> > or
> > > > > so, you can create a sequence file with the individual smaller file
> > > > content
> > > > > as values.
> > > > >
> > > > >  Thanks. I was planning to use pig's
> > > > org.apache.pig.piggybank.storage.XMLLoader
> > > > for processing. Would it work with sequence file?
> > > >
> > > > This text file that I was referring to would be in hdfs itself. Is it
> > > still
> > > > different than using sequence file?
> > > >
> > > > > Regards
> > > > > Bejoy.K.S
> > > > >
> > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <
> > > mohitanch...@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > We have small xml files. Currently I am planning to append these
> > > small
> > > > > > files to one file in hdfs so that I can take advantage of splits,
> > > > larger
> > > > > > blocks and sequential IO. What I am unsure is if it's ok to
> append
> > > one
> > > > > file
> > > > > > at a time to this hdfs file
> > > > > >
> > > > > > Could someone suggest if this is ok? Would like to know how other
> > do
> > > > it.
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Writing small files to one big file in hdfs

2012-02-21 Thread Arko Provo Mukherjee
Hi,

Let's say all the smaller files are in the same directory.

Then u can do:

*BufferedWriter output = new BufferedWriter
(newOutputStreamWriter(fs.create(output_path,
true)));  // Output path*

*FileStatus[] output_files = fs.listStatus(new Path(input_path));  // Input
directory*

*for ( int i=0; i < output_files.length; i++ )  *

*{*

*   BufferedReader reader = new
BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(;
*

*   String data;*

*   data = reader.readLine();*

*   while ( data != null ) *

*  {*

*output.write(data);*

*  }*

*reader.close*

*}*

*output.close*


In case you have the files in multiple directories, call the code for each
of them with different input paths.

Hope this helps!

Cheers

Arko

On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia wrote:

> I am trying to look for examples that demonstrates using sequence files
> including writing to it and then running mapred on it, but unable to find
> one. Could you please point me to some examples of sequence files?
>
> On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks  wrote:
>
> > Hi Mohit
> >  AFAIK XMLLoader in pig won't be suited for Sequence Files. Please
> > post the same to Pig user group for some workaround over the same.
> > SequenceFIle is a preferred option when we want to store small
> > files in hdfs and needs to be processed by MapReduce as it stores data in
> > key value format.Since SequenceFileInputFormat is available at your
> > disposal you don't need any custom input formats for processing the same
> > using map reduce. It is a cleaner and better approach compared to just
> > appending small xml file contents into a big file.
> >
> > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia  > >wrote:
> >
> > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks 
> > wrote:
> > >
> > > > Mohit
> > > >   Rather than just appending the content into a normal text file
> or
> > > > so, you can create a sequence file with the individual smaller file
> > > content
> > > > as values.
> > > >
> > > >  Thanks. I was planning to use pig's
> > > org.apache.pig.piggybank.storage.XMLLoader
> > > for processing. Would it work with sequence file?
> > >
> > > This text file that I was referring to would be in hdfs itself. Is it
> > still
> > > different than using sequence file?
> > >
> > > > Regards
> > > > Bejoy.K.S
> > > >
> > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <
> > mohitanch...@gmail.com
> > > > >wrote:
> > > >
> > > > > We have small xml files. Currently I am planning to append these
> > small
> > > > > files to one file in hdfs so that I can take advantage of splits,
> > > larger
> > > > > blocks and sequential IO. What I am unsure is if it's ok to append
> > one
> > > > file
> > > > > at a time to this hdfs file
> > > > >
> > > > > Could someone suggest if this is ok? Would like to know how other
> do
> > > it.
> > > > >
> > > >
> > >
> >
>


Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
I am trying to look for examples that demonstrates using sequence files
including writing to it and then running mapred on it, but unable to find
one. Could you please point me to some examples of sequence files?

On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks  wrote:

> Hi Mohit
>  AFAIK XMLLoader in pig won't be suited for Sequence Files. Please
> post the same to Pig user group for some workaround over the same.
> SequenceFIle is a preferred option when we want to store small
> files in hdfs and needs to be processed by MapReduce as it stores data in
> key value format.Since SequenceFileInputFormat is available at your
> disposal you don't need any custom input formats for processing the same
> using map reduce. It is a cleaner and better approach compared to just
> appending small xml file contents into a big file.
>
> On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia  >wrote:
>
> > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks 
> wrote:
> >
> > > Mohit
> > >   Rather than just appending the content into a normal text file or
> > > so, you can create a sequence file with the individual smaller file
> > content
> > > as values.
> > >
> > >  Thanks. I was planning to use pig's
> > org.apache.pig.piggybank.storage.XMLLoader
> > for processing. Would it work with sequence file?
> >
> > This text file that I was referring to would be in hdfs itself. Is it
> still
> > different than using sequence file?
> >
> > > Regards
> > > Bejoy.K.S
> > >
> > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <
> mohitanch...@gmail.com
> > > >wrote:
> > >
> > > > We have small xml files. Currently I am planning to append these
> small
> > > > files to one file in hdfs so that I can take advantage of splits,
> > larger
> > > > blocks and sequential IO. What I am unsure is if it's ok to append
> one
> > > file
> > > > at a time to this hdfs file
> > > >
> > > > Could someone suggest if this is ok? Would like to know how other do
> > it.
> > > >
> > >
> >
>


Re: Writing small files to one big file in hdfs

2012-02-21 Thread Bill Graham
You might want to check out File Crusher:
http://www.jointhegrid.com/hadoop_filecrush/index.jsp

I've never used it, but it sounds like it could be helpful.

On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks  wrote:

> Hi Mohit
>  AFAIK XMLLoader in pig won't be suited for Sequence Files. Please
> post the same to Pig user group for some workaround over the same.
> SequenceFIle is a preferred option when we want to store small
> files in hdfs and needs to be processed by MapReduce as it stores data in
> key value format.Since SequenceFileInputFormat is available at your
> disposal you don't need any custom input formats for processing the same
> using map reduce. It is a cleaner and better approach compared to just
> appending small xml file contents into a big file.
>
> On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia  >wrote:
>
> > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks 
> wrote:
> >
> > > Mohit
> > >   Rather than just appending the content into a normal text file or
> > > so, you can create a sequence file with the individual smaller file
> > content
> > > as values.
> > >
> > >  Thanks. I was planning to use pig's
> > org.apache.pig.piggybank.storage.XMLLoader
> > for processing. Would it work with sequence file?
> >
> > This text file that I was referring to would be in hdfs itself. Is it
> still
> > different than using sequence file?
> >
> > > Regards
> > > Bejoy.K.S
> > >
> > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <
> mohitanch...@gmail.com
> > > >wrote:
> > >
> > > > We have small xml files. Currently I am planning to append these
> small
> > > > files to one file in hdfs so that I can take advantage of splits,
> > larger
> > > > blocks and sequential IO. What I am unsure is if it's ok to append
> one
> > > file
> > > > at a time to this hdfs file
> > > >
> > > > Could someone suggest if this is ok? Would like to know how other do
> > it.
> > > >
> > >
> >
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgra...@gmail.com going forward.*


Re: Writing small files to one big file in hdfs

2012-02-21 Thread Bejoy Ks
Hi Mohit
  AFAIK XMLLoader in pig won't be suited for Sequence Files. Please
post the same to Pig user group for some workaround over the same.
 SequenceFIle is a preferred option when we want to store small
files in hdfs and needs to be processed by MapReduce as it stores data in
key value format.Since SequenceFileInputFormat is available at your
disposal you don't need any custom input formats for processing the same
using map reduce. It is a cleaner and better approach compared to just
appending small xml file contents into a big file.

On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia wrote:

> On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks  wrote:
>
> > Mohit
> >   Rather than just appending the content into a normal text file or
> > so, you can create a sequence file with the individual smaller file
> content
> > as values.
> >
> >  Thanks. I was planning to use pig's
> org.apache.pig.piggybank.storage.XMLLoader
> for processing. Would it work with sequence file?
>
> This text file that I was referring to would be in hdfs itself. Is it still
> different than using sequence file?
>
> > Regards
> > Bejoy.K.S
> >
> > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia  > >wrote:
> >
> > > We have small xml files. Currently I am planning to append these small
> > > files to one file in hdfs so that I can take advantage of splits,
> larger
> > > blocks and sequential IO. What I am unsure is if it's ok to append one
> > file
> > > at a time to this hdfs file
> > >
> > > Could someone suggest if this is ok? Would like to know how other do
> it.
> > >
> >
>


Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks  wrote:

> Mohit
>   Rather than just appending the content into a normal text file or
> so, you can create a sequence file with the individual smaller file content
> as values.
>
>  Thanks. I was planning to use pig's 
> org.apache.pig.piggybank.storage.XMLLoader
for processing. Would it work with sequence file?

This text file that I was referring to would be in hdfs itself. Is it still
different than using sequence file?

> Regards
> Bejoy.K.S
>
> On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia  >wrote:
>
> > We have small xml files. Currently I am planning to append these small
> > files to one file in hdfs so that I can take advantage of splits, larger
> > blocks and sequential IO. What I am unsure is if it's ok to append one
> file
> > at a time to this hdfs file
> >
> > Could someone suggest if this is ok? Would like to know how other do it.
> >
>


Re: Writing small files to one big file in hdfs

2012-02-21 Thread Bejoy Ks
Mohit
   Rather than just appending the content into a normal text file or
so, you can create a sequence file with the individual smaller file content
as values.

Regards
Bejoy.K.S

On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia wrote:

> We have small xml files. Currently I am planning to append these small
> files to one file in hdfs so that I can take advantage of splits, larger
> blocks and sequential IO. What I am unsure is if it's ok to append one file
> at a time to this hdfs file
>
> Could someone suggest if this is ok? Would like to know how other do it.
>


Re: Writing small files to one big file in hdfs

2012-02-21 Thread Joey Echeverria
I'd recommend making a SequenceFile[1] to store each XML file as a value.

-Joey

[1]
http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/SequenceFile.html

On Tue, Feb 21, 2012 at 12:15 PM, Mohit Anchlia wrote:

> We have small xml files. Currently I am planning to append these small
> files to one file in hdfs so that I can take advantage of splits, larger
> blocks and sequential IO. What I am unsure is if it's ok to append one file
> at a time to this hdfs file
>
> Could someone suggest if this is ok? Would like to know how other do it.
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434