Re: Writing small files to one big file in hdfs
Finally figured it out. I needed to use SequenceFileAstextInputFormat. There is just lack of examples that makes it difficult when you start. On Tue, Feb 21, 2012 at 4:50 PM, Mohit Anchlia wrote: > It looks like in mapper values are coming as binary instead of Text. Is > this expected from sequence file? I initially wrote SequenceFile with Text > values. > > > On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia wrote: > >> Need some more help. I wrote sequence file using below code but now when >> I run mapreduce job I get "file.*java.lang.ClassCastException*: >> org.apache.hadoop.io.LongWritable cannot be cast to >> org.apache.hadoop.io.Text" even though I didn't use LongWritable when I >> originally wrote to the sequence >> >> //Code to write to the sequence file. There is no LongWritable here >> >> org.apache.hadoop.io.Text key = >> *new* org.apache.hadoop.io.Text(); >> >> BufferedReader buffer = >> *new* BufferedReader(*new* FileReader(filePath)); >> >> String line = >> *null*; >> >> org.apache.hadoop.io.Text value = >> *new* org.apache.hadoop.io.Text(); >> >> *try* { >> >> writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(), >> >> value.getClass(), SequenceFile.CompressionType. >> *RECORD*); >> >> *int* i = 1; >> >> *long* timestamp=System.*currentTimeMillis*(); >> >> *while* ((line = buffer.readLine()) != *null*) { >> >> key.set(String.*valueOf*(timestamp)); >> >> value.set(line); >> >> writer.append(key, value); >> >> i++; >> >> } >> >> >> On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee < >> arkoprovomukher...@gmail.com> wrote: >> >>> Hi, >>> >>> I think the following link will help: >>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html >>> >>> Cheers >>> Arko >>> >>> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia >> >wrote: >>> >>> > Sorry may be it's something obvious but I was wondering when map or >>> reduce >>> > gets called what would be the class used for key and value? If I used >>> > "org.apache.hadoop.io.Text >>> > value = *new* org.apache.hadoop.io.Text();" would the map be called >>> with >>> > Text class? >>> > >>> > public void map(LongWritable key, Text value, Context context) throws >>> > IOException, InterruptedException { >>> > >>> > >>> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee < >>> > arkoprovomukher...@gmail.com> wrote: >>> > >>> > > Hi Mohit, >>> > > >>> > > I am not sure that I understand your question. >>> > > >>> > > But you can write into a file using: >>> > > *BufferedWriter output = new BufferedWriter >>> > > (new OutputStreamWriter(fs.create(my_path,true)));* >>> > > *output.write(data);* >>> > > * >>> > > * >>> > > Then you can pass that file as the input to your MapReduce program. >>> > > >>> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* >>> > > >>> > > From inside your Map/Reduce methods, I think you should NOT be >>> tinkering >>> > > with the input / output paths of that Map/Reduce job. >>> > > Cheers >>> > > Arko >>> > > >>> > > >>> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia < >>> mohitanch...@gmail.com >>> > > >wrote: >>> > > >>> > > > Thanks How does mapreduce work on sequence file? Is there an >>> example I >>> > > can >>> > > > look at? >>> > > > >>> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < >>> > > > arkoprovomukher...@gmail.com> wrote: >>> > > > >>> > > > > Hi, >>> > > > > >>> > > > > Let's say all the smaller files are in the same directory. >>> > > > > >>> > > > > Then u can do: >>> > > > > >>> > > > > *BufferedWriter output = new BufferedWriter >>> > > > > (newOutputStreamWriter(fs.create(output_path, >>> > > > > true))); // Output path* >>> > > > > >>> > > > > *FileStatus[] output_files = fs.listStatus(new >>> Path(input_path)); // >>> > > > Input >>> > > > > directory* >>> > > > > >>> > > > > *for ( int i=0; i < output_files.length; i++ ) * >>> > > > > >>> > > > > *{* >>> > > > > >>> > > > > * BufferedReader reader = new >>> > > > > >>> > > >>> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(; >>> > > > > * >>> > > > > >>> > > > > * String data;* >>> > > > > >>> > > > > * data = reader.readLine();* >>> > > > > >>> > > > > * while ( data != null ) * >>> > > > > >>> > > > > * {* >>> > > > > >>> > > > > *output.write(data);* >>> > > > > >>> > > > > * }* >>> > > > > >>> > > > > *reader.close* >>> > > > > >>> > > > > *}* >>> > > > > >>> > > > > *output.close* >>> > > > > >>> > > > > >>> > > > > In case you have the files in multiple directories, call the >>> code for >>> > > > each >>> > > > > of them with different input paths. >>> > > > > >>> > > > > Hope this helps! >>> > > > > >>> > > > > Cheers >>> > > > > >>> > > > > Arko >>> > > > > >>> > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia < >>> > mohitanch...@gmail.com >>> > > > > >wrote: >>> > > > > >>> > > > > > I am trying to look for examples that demonstrates using >>> sequence >>> > > files >>> > > > > > including writing to it and then running
Re: Writing small files to one big file in hdfs
On Tue, Feb 21, 2012 at 7:50 PM, Mohit Anchlia wrote: > It looks like in mapper values are coming as binary instead of Text. Is > this expected from sequence file? I initially wrote SequenceFile with Text > values. > > On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia wrote: > >> Need some more help. I wrote sequence file using below code but now when I >> run mapreduce job I get "file.*java.lang.ClassCastException*: >> org.apache.hadoop.io.LongWritable cannot be cast to >> org.apache.hadoop.io.Text" even though I didn't use LongWritable when I >> originally wrote to the sequence >> >> //Code to write to the sequence file. There is no LongWritable here >> >> org.apache.hadoop.io.Text key = >> *new* org.apache.hadoop.io.Text(); >> >> BufferedReader buffer = >> *new* BufferedReader(*new* FileReader(filePath)); >> >> String line = >> *null*; >> >> org.apache.hadoop.io.Text value = >> *new* org.apache.hadoop.io.Text(); >> >> *try* { >> >> writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(), >> >> value.getClass(), SequenceFile.CompressionType. >> *RECORD*); >> >> *int* i = 1; >> >> *long* timestamp=System.*currentTimeMillis*(); >> >> *while* ((line = buffer.readLine()) != *null*) { >> >> key.set(String.*valueOf*(timestamp)); >> >> value.set(line); >> >> writer.append(key, value); >> >> i++; >> >> } >> >> >> On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee < >> arkoprovomukher...@gmail.com> wrote: >> >>> Hi, >>> >>> I think the following link will help: >>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html >>> >>> Cheers >>> Arko >>> >>> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia >> >wrote: >>> >>> > Sorry may be it's something obvious but I was wondering when map or >>> reduce >>> > gets called what would be the class used for key and value? If I used >>> > "org.apache.hadoop.io.Text >>> > value = *new* org.apache.hadoop.io.Text();" would the map be called with >>> > Text class? >>> > >>> > public void map(LongWritable key, Text value, Context context) throws >>> > IOException, InterruptedException { >>> > >>> > >>> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee < >>> > arkoprovomukher...@gmail.com> wrote: >>> > >>> > > Hi Mohit, >>> > > >>> > > I am not sure that I understand your question. >>> > > >>> > > But you can write into a file using: >>> > > *BufferedWriter output = new BufferedWriter >>> > > (new OutputStreamWriter(fs.create(my_path,true)));* >>> > > *output.write(data);* >>> > > * >>> > > * >>> > > Then you can pass that file as the input to your MapReduce program. >>> > > >>> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* >>> > > >>> > > From inside your Map/Reduce methods, I think you should NOT be >>> tinkering >>> > > with the input / output paths of that Map/Reduce job. >>> > > Cheers >>> > > Arko >>> > > >>> > > >>> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia < >>> mohitanch...@gmail.com >>> > > >wrote: >>> > > >>> > > > Thanks How does mapreduce work on sequence file? Is there an >>> example I >>> > > can >>> > > > look at? >>> > > > >>> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < >>> > > > arkoprovomukher...@gmail.com> wrote: >>> > > > >>> > > > > Hi, >>> > > > > >>> > > > > Let's say all the smaller files are in the same directory. >>> > > > > >>> > > > > Then u can do: >>> > > > > >>> > > > > *BufferedWriter output = new BufferedWriter >>> > > > > (newOutputStreamWriter(fs.create(output_path, >>> > > > > true))); // Output path* >>> > > > > >>> > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); >>> // >>> > > > Input >>> > > > > directory* >>> > > > > >>> > > > > *for ( int i=0; i < output_files.length; i++ ) * >>> > > > > >>> > > > > *{* >>> > > > > >>> > > > > * BufferedReader reader = new >>> > > > > >>> > > >>> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(; >>> > > > > * >>> > > > > >>> > > > > * String data;* >>> > > > > >>> > > > > * data = reader.readLine();* >>> > > > > >>> > > > > * while ( data != null ) * >>> > > > > >>> > > > > * {* >>> > > > > >>> > > > > * output.write(data);* >>> > > > > >>> > > > > * }* >>> > > > > >>> > > > > * reader.close* >>> > > > > >>> > > > > *}* >>> > > > > >>> > > > > *output.close* >>> > > > > >>> > > > > >>> > > > > In case you have the files in multiple directories, call the code >>> for >>> > > > each >>> > > > > of them with different input paths. >>> > > > > >>> > > > > Hope this helps! >>> > > > > >>> > > > > Cheers >>> > > > > >>> > > > > Arko >>> > > > > >>> > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia < >>> > mohitanch...@gmail.com >>> > > > > >wrote: >>> > > > > >>> > > > > > I am trying to look for examples that demonstrates using >>> sequence >>> > > files >>> > > > > > including writing to it and then running mapred on it, but >>> unable >>> > to >>> > > > find >>> > > > > > one. Could you please point me to some examples of sequence >>> files? >>> > > > >
Re: Writing small files to one big file in hdfs
It looks like in mapper values are coming as binary instead of Text. Is this expected from sequence file? I initially wrote SequenceFile with Text values. On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia wrote: > Need some more help. I wrote sequence file using below code but now when I > run mapreduce job I get "file.*java.lang.ClassCastException*: > org.apache.hadoop.io.LongWritable cannot be cast to > org.apache.hadoop.io.Text" even though I didn't use LongWritable when I > originally wrote to the sequence > > //Code to write to the sequence file. There is no LongWritable here > > org.apache.hadoop.io.Text key = > *new* org.apache.hadoop.io.Text(); > > BufferedReader buffer = > *new* BufferedReader(*new* FileReader(filePath)); > > String line = > *null*; > > org.apache.hadoop.io.Text value = > *new* org.apache.hadoop.io.Text(); > > *try* { > > writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(), > > value.getClass(), SequenceFile.CompressionType. > *RECORD*); > > *int* i = 1; > > *long* timestamp=System.*currentTimeMillis*(); > > *while* ((line = buffer.readLine()) != *null*) { > > key.set(String.*valueOf*(timestamp)); > > value.set(line); > > writer.append(key, value); > > i++; > > } > > > On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee < > arkoprovomukher...@gmail.com> wrote: > >> Hi, >> >> I think the following link will help: >> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html >> >> Cheers >> Arko >> >> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia > >wrote: >> >> > Sorry may be it's something obvious but I was wondering when map or >> reduce >> > gets called what would be the class used for key and value? If I used >> > "org.apache.hadoop.io.Text >> > value = *new* org.apache.hadoop.io.Text();" would the map be called with >> > Text class? >> > >> > public void map(LongWritable key, Text value, Context context) throws >> > IOException, InterruptedException { >> > >> > >> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee < >> > arkoprovomukher...@gmail.com> wrote: >> > >> > > Hi Mohit, >> > > >> > > I am not sure that I understand your question. >> > > >> > > But you can write into a file using: >> > > *BufferedWriter output = new BufferedWriter >> > > (new OutputStreamWriter(fs.create(my_path,true)));* >> > > *output.write(data);* >> > > * >> > > * >> > > Then you can pass that file as the input to your MapReduce program. >> > > >> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* >> > > >> > > From inside your Map/Reduce methods, I think you should NOT be >> tinkering >> > > with the input / output paths of that Map/Reduce job. >> > > Cheers >> > > Arko >> > > >> > > >> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia < >> mohitanch...@gmail.com >> > > >wrote: >> > > >> > > > Thanks How does mapreduce work on sequence file? Is there an >> example I >> > > can >> > > > look at? >> > > > >> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < >> > > > arkoprovomukher...@gmail.com> wrote: >> > > > >> > > > > Hi, >> > > > > >> > > > > Let's say all the smaller files are in the same directory. >> > > > > >> > > > > Then u can do: >> > > > > >> > > > > *BufferedWriter output = new BufferedWriter >> > > > > (newOutputStreamWriter(fs.create(output_path, >> > > > > true))); // Output path* >> > > > > >> > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); >> // >> > > > Input >> > > > > directory* >> > > > > >> > > > > *for ( int i=0; i < output_files.length; i++ ) * >> > > > > >> > > > > *{* >> > > > > >> > > > > * BufferedReader reader = new >> > > > > >> > > >> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(; >> > > > > * >> > > > > >> > > > > * String data;* >> > > > > >> > > > > * data = reader.readLine();* >> > > > > >> > > > > * while ( data != null ) * >> > > > > >> > > > > * {* >> > > > > >> > > > > *output.write(data);* >> > > > > >> > > > > * }* >> > > > > >> > > > > *reader.close* >> > > > > >> > > > > *}* >> > > > > >> > > > > *output.close* >> > > > > >> > > > > >> > > > > In case you have the files in multiple directories, call the code >> for >> > > > each >> > > > > of them with different input paths. >> > > > > >> > > > > Hope this helps! >> > > > > >> > > > > Cheers >> > > > > >> > > > > Arko >> > > > > >> > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia < >> > mohitanch...@gmail.com >> > > > > >wrote: >> > > > > >> > > > > > I am trying to look for examples that demonstrates using >> sequence >> > > files >> > > > > > including writing to it and then running mapred on it, but >> unable >> > to >> > > > find >> > > > > > one. Could you please point me to some examples of sequence >> files? >> > > > > > >> > > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks < >> bejoy.had...@gmail.com >> > > >> > > > > wrote: >> > > > > > >> > > > > > > Hi Mohit >> > > > > > > AFAIK XMLLoader in pig won't be suited for Sequence >> Files. >> > > >
Re: Writing small files to one big file in hdfs
Need some more help. I wrote sequence file using below code but now when I run mapreduce job I get "file.*java.lang.ClassCastException*: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text" even though I didn't use LongWritable when I originally wrote to the sequence //Code to write to the sequence file. There is no LongWritable here org.apache.hadoop.io.Text key = *new* org.apache.hadoop.io.Text(); BufferedReader buffer = *new* BufferedReader(*new* FileReader(filePath)); String line = *null*; org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text(); *try* { writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(), value.getClass(), SequenceFile.CompressionType.*RECORD*); *int* i = 1; *long* timestamp=System.*currentTimeMillis*(); *while* ((line = buffer.readLine()) != *null*) { key.set(String.*valueOf*(timestamp)); value.set(line); writer.append(key, value); i++; } On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Hi, > > I think the following link will help: > http://hadoop.apache.org/common/docs/current/mapred_tutorial.html > > Cheers > Arko > > On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia >wrote: > > > Sorry may be it's something obvious but I was wondering when map or > reduce > > gets called what would be the class used for key and value? If I used > > "org.apache.hadoop.io.Text > > value = *new* org.apache.hadoop.io.Text();" would the map be called with > > Text class? > > > > public void map(LongWritable key, Text value, Context context) throws > > IOException, InterruptedException { > > > > > > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee < > > arkoprovomukher...@gmail.com> wrote: > > > > > Hi Mohit, > > > > > > I am not sure that I understand your question. > > > > > > But you can write into a file using: > > > *BufferedWriter output = new BufferedWriter > > > (new OutputStreamWriter(fs.create(my_path,true)));* > > > *output.write(data);* > > > * > > > * > > > Then you can pass that file as the input to your MapReduce program. > > > > > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* > > > > > > From inside your Map/Reduce methods, I think you should NOT be > tinkering > > > with the input / output paths of that Map/Reduce job. > > > Cheers > > > Arko > > > > > > > > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia > > >wrote: > > > > > > > Thanks How does mapreduce work on sequence file? Is there an example > I > > > can > > > > look at? > > > > > > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < > > > > arkoprovomukher...@gmail.com> wrote: > > > > > > > > > Hi, > > > > > > > > > > Let's say all the smaller files are in the same directory. > > > > > > > > > > Then u can do: > > > > > > > > > > *BufferedWriter output = new BufferedWriter > > > > > (newOutputStreamWriter(fs.create(output_path, > > > > > true))); // Output path* > > > > > > > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); > // > > > > Input > > > > > directory* > > > > > > > > > > *for ( int i=0; i < output_files.length; i++ ) * > > > > > > > > > > *{* > > > > > > > > > > * BufferedReader reader = new > > > > > > > > > BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(; > > > > > * > > > > > > > > > > * String data;* > > > > > > > > > > * data = reader.readLine();* > > > > > > > > > > * while ( data != null ) * > > > > > > > > > > * {* > > > > > > > > > > *output.write(data);* > > > > > > > > > > * }* > > > > > > > > > > *reader.close* > > > > > > > > > > *}* > > > > > > > > > > *output.close* > > > > > > > > > > > > > > > In case you have the files in multiple directories, call the code > for > > > > each > > > > > of them with different input paths. > > > > > > > > > > Hope this helps! > > > > > > > > > > Cheers > > > > > > > > > > Arko > > > > > > > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia < > > mohitanch...@gmail.com > > > > > >wrote: > > > > > > > > > > > I am trying to look for examples that demonstrates using sequence > > > files > > > > > > including writing to it and then running mapred on it, but unable > > to > > > > find > > > > > > one. Could you please point me to some examples of sequence > files? > > > > > > > > > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks < > bejoy.had...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > Hi Mohit > > > > > > > AFAIK XMLLoader in pig won't be suited for Sequence Files. > > > > Please > > > > > > > post the same to Pig user group for some workaround over the > > same. > > > > > > > SequenceFIle is a preferred option when we want to > store > > > > small > > > > > > > files in hdfs and needs to be processed by MapReduce as it > stores > > > > data > > > > > in > > > > > > > key value format.Since SequenceFileInputFormat is available at > > your > > > > > > > disposal you don't need any custom input formats for processing
Re: Writing small files to one big file in hdfs
Hi, I think the following link will help: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html Cheers Arko On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia wrote: > Sorry may be it's something obvious but I was wondering when map or reduce > gets called what would be the class used for key and value? If I used > "org.apache.hadoop.io.Text > value = *new* org.apache.hadoop.io.Text();" would the map be called with > Text class? > > public void map(LongWritable key, Text value, Context context) throws > IOException, InterruptedException { > > > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee < > arkoprovomukher...@gmail.com> wrote: > > > Hi Mohit, > > > > I am not sure that I understand your question. > > > > But you can write into a file using: > > *BufferedWriter output = new BufferedWriter > > (new OutputStreamWriter(fs.create(my_path,true)));* > > *output.write(data);* > > * > > * > > Then you can pass that file as the input to your MapReduce program. > > > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* > > > > From inside your Map/Reduce methods, I think you should NOT be tinkering > > with the input / output paths of that Map/Reduce job. > > Cheers > > Arko > > > > > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia > >wrote: > > > > > Thanks How does mapreduce work on sequence file? Is there an example I > > can > > > look at? > > > > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < > > > arkoprovomukher...@gmail.com> wrote: > > > > > > > Hi, > > > > > > > > Let's say all the smaller files are in the same directory. > > > > > > > > Then u can do: > > > > > > > > *BufferedWriter output = new BufferedWriter > > > > (newOutputStreamWriter(fs.create(output_path, > > > > true))); // Output path* > > > > > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // > > > Input > > > > directory* > > > > > > > > *for ( int i=0; i < output_files.length; i++ ) * > > > > > > > > *{* > > > > > > > > * BufferedReader reader = new > > > > > > BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(; > > > > * > > > > > > > > * String data;* > > > > > > > > * data = reader.readLine();* > > > > > > > > * while ( data != null ) * > > > > > > > > * {* > > > > > > > > *output.write(data);* > > > > > > > > * }* > > > > > > > > *reader.close* > > > > > > > > *}* > > > > > > > > *output.close* > > > > > > > > > > > > In case you have the files in multiple directories, call the code for > > > each > > > > of them with different input paths. > > > > > > > > Hope this helps! > > > > > > > > Cheers > > > > > > > > Arko > > > > > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia < > mohitanch...@gmail.com > > > > >wrote: > > > > > > > > > I am trying to look for examples that demonstrates using sequence > > files > > > > > including writing to it and then running mapred on it, but unable > to > > > find > > > > > one. Could you please point me to some examples of sequence files? > > > > > > > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks > > > > > wrote: > > > > > > > > > > > Hi Mohit > > > > > > AFAIK XMLLoader in pig won't be suited for Sequence Files. > > > Please > > > > > > post the same to Pig user group for some workaround over the > same. > > > > > > SequenceFIle is a preferred option when we want to store > > > small > > > > > > files in hdfs and needs to be processed by MapReduce as it stores > > > data > > > > in > > > > > > key value format.Since SequenceFileInputFormat is available at > your > > > > > > disposal you don't need any custom input formats for processing > the > > > > same > > > > > > using map reduce. It is a cleaner and better approach compared to > > > just > > > > > > appending small xml file contents into a big file. > > > > > > > > > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia < > > > > mohitanch...@gmail.com > > > > > > >wrote: > > > > > > > > > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks < > > bejoy.had...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > Mohit > > > > > > > > Rather than just appending the content into a normal > text > > > > file > > > > > or > > > > > > > > so, you can create a sequence file with the individual > smaller > > > file > > > > > > > content > > > > > > > > as values. > > > > > > > > > > > > > > > > Thanks. I was planning to use pig's > > > > > > > org.apache.pig.piggybank.storage.XMLLoader > > > > > > > for processing. Would it work with sequence file? > > > > > > > > > > > > > > This text file that I was referring to would be in hdfs itself. > > Is > > > it > > > > > > still > > > > > > > different than using sequence file? > > > > > > > > > > > > > > > Regards > > > > > > > > Bejoy.K.S > > > > > > > > > > > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia < > > > > > > mohitanch...@gmail.com > > > > > > > > >wrote: > > > > > > > > > > > > > > > > > We have small xml files. Currently I am plannin
Re: Writing small files to one big file in hdfs
Sorry may be it's something obvious but I was wondering when map or reduce gets called what would be the class used for key and value? If I used "org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text();" would the map be called with Text class? public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Hi Mohit, > > I am not sure that I understand your question. > > But you can write into a file using: > *BufferedWriter output = new BufferedWriter > (new OutputStreamWriter(fs.create(my_path,true)));* > *output.write(data);* > * > * > Then you can pass that file as the input to your MapReduce program. > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* > > From inside your Map/Reduce methods, I think you should NOT be tinkering > with the input / output paths of that Map/Reduce job. > Cheers > Arko > > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia >wrote: > > > Thanks How does mapreduce work on sequence file? Is there an example I > can > > look at? > > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < > > arkoprovomukher...@gmail.com> wrote: > > > > > Hi, > > > > > > Let's say all the smaller files are in the same directory. > > > > > > Then u can do: > > > > > > *BufferedWriter output = new BufferedWriter > > > (newOutputStreamWriter(fs.create(output_path, > > > true))); // Output path* > > > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // > > Input > > > directory* > > > > > > *for ( int i=0; i < output_files.length; i++ ) * > > > > > > *{* > > > > > > * BufferedReader reader = new > > > > BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(; > > > * > > > > > > * String data;* > > > > > > * data = reader.readLine();* > > > > > > * while ( data != null ) * > > > > > > * {* > > > > > > *output.write(data);* > > > > > > * }* > > > > > > *reader.close* > > > > > > *}* > > > > > > *output.close* > > > > > > > > > In case you have the files in multiple directories, call the code for > > each > > > of them with different input paths. > > > > > > Hope this helps! > > > > > > Cheers > > > > > > Arko > > > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia > > >wrote: > > > > > > > I am trying to look for examples that demonstrates using sequence > files > > > > including writing to it and then running mapred on it, but unable to > > find > > > > one. Could you please point me to some examples of sequence files? > > > > > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks > > > wrote: > > > > > > > > > Hi Mohit > > > > > AFAIK XMLLoader in pig won't be suited for Sequence Files. > > Please > > > > > post the same to Pig user group for some workaround over the same. > > > > > SequenceFIle is a preferred option when we want to store > > small > > > > > files in hdfs and needs to be processed by MapReduce as it stores > > data > > > in > > > > > key value format.Since SequenceFileInputFormat is available at your > > > > > disposal you don't need any custom input formats for processing the > > > same > > > > > using map reduce. It is a cleaner and better approach compared to > > just > > > > > appending small xml file contents into a big file. > > > > > > > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia < > > > mohitanch...@gmail.com > > > > > >wrote: > > > > > > > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks < > bejoy.had...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > Mohit > > > > > > > Rather than just appending the content into a normal text > > > file > > > > or > > > > > > > so, you can create a sequence file with the individual smaller > > file > > > > > > content > > > > > > > as values. > > > > > > > > > > > > > > Thanks. I was planning to use pig's > > > > > > org.apache.pig.piggybank.storage.XMLLoader > > > > > > for processing. Would it work with sequence file? > > > > > > > > > > > > This text file that I was referring to would be in hdfs itself. > Is > > it > > > > > still > > > > > > different than using sequence file? > > > > > > > > > > > > > Regards > > > > > > > Bejoy.K.S > > > > > > > > > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia < > > > > > mohitanch...@gmail.com > > > > > > > >wrote: > > > > > > > > > > > > > > > We have small xml files. Currently I am planning to append > > these > > > > > small > > > > > > > > files to one file in hdfs so that I can take advantage of > > splits, > > > > > > larger > > > > > > > > blocks and sequential IO. What I am unsure is if it's ok to > > > append > > > > > one > > > > > > > file > > > > > > > > at a time to this hdfs file > > > > > > > > > > > > > > > > Could someone suggest if this is ok? Would like to know how > > other > > > > do > > > > > > it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Re: Writing small files to one big file in hdfs
Hi Mohit, I am not sure that I understand your question. But you can write into a file using: *BufferedWriter output = new BufferedWriter (new OutputStreamWriter(fs.create(my_path,true)));* *output.write(data);* * * Then you can pass that file as the input to your MapReduce program. *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* >From inside your Map/Reduce methods, I think you should NOT be tinkering with the input / output paths of that Map/Reduce job. Cheers Arko On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia wrote: > Thanks How does mapreduce work on sequence file? Is there an example I can > look at? > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < > arkoprovomukher...@gmail.com> wrote: > > > Hi, > > > > Let's say all the smaller files are in the same directory. > > > > Then u can do: > > > > *BufferedWriter output = new BufferedWriter > > (newOutputStreamWriter(fs.create(output_path, > > true))); // Output path* > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // > Input > > directory* > > > > *for ( int i=0; i < output_files.length; i++ ) * > > > > *{* > > > > * BufferedReader reader = new > > BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(; > > * > > > > * String data;* > > > > * data = reader.readLine();* > > > > * while ( data != null ) * > > > > * {* > > > > *output.write(data);* > > > > * }* > > > > *reader.close* > > > > *}* > > > > *output.close* > > > > > > In case you have the files in multiple directories, call the code for > each > > of them with different input paths. > > > > Hope this helps! > > > > Cheers > > > > Arko > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia > >wrote: > > > > > I am trying to look for examples that demonstrates using sequence files > > > including writing to it and then running mapred on it, but unable to > find > > > one. Could you please point me to some examples of sequence files? > > > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks > > wrote: > > > > > > > Hi Mohit > > > > AFAIK XMLLoader in pig won't be suited for Sequence Files. > Please > > > > post the same to Pig user group for some workaround over the same. > > > > SequenceFIle is a preferred option when we want to store > small > > > > files in hdfs and needs to be processed by MapReduce as it stores > data > > in > > > > key value format.Since SequenceFileInputFormat is available at your > > > > disposal you don't need any custom input formats for processing the > > same > > > > using map reduce. It is a cleaner and better approach compared to > just > > > > appending small xml file contents into a big file. > > > > > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia < > > mohitanch...@gmail.com > > > > >wrote: > > > > > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks > > > > wrote: > > > > > > > > > > > Mohit > > > > > > Rather than just appending the content into a normal text > > file > > > or > > > > > > so, you can create a sequence file with the individual smaller > file > > > > > content > > > > > > as values. > > > > > > > > > > > > Thanks. I was planning to use pig's > > > > > org.apache.pig.piggybank.storage.XMLLoader > > > > > for processing. Would it work with sequence file? > > > > > > > > > > This text file that I was referring to would be in hdfs itself. Is > it > > > > still > > > > > different than using sequence file? > > > > > > > > > > > Regards > > > > > > Bejoy.K.S > > > > > > > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia < > > > > mohitanch...@gmail.com > > > > > > >wrote: > > > > > > > > > > > > > We have small xml files. Currently I am planning to append > these > > > > small > > > > > > > files to one file in hdfs so that I can take advantage of > splits, > > > > > larger > > > > > > > blocks and sequential IO. What I am unsure is if it's ok to > > append > > > > one > > > > > > file > > > > > > > at a time to this hdfs file > > > > > > > > > > > > > > Could someone suggest if this is ok? Would like to know how > other > > > do > > > > > it. > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Re: Writing small files to one big file in hdfs
Thanks How does mapreduce work on sequence file? Is there an example I can look at? On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Hi, > > Let's say all the smaller files are in the same directory. > > Then u can do: > > *BufferedWriter output = new BufferedWriter > (newOutputStreamWriter(fs.create(output_path, > true))); // Output path* > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // Input > directory* > > *for ( int i=0; i < output_files.length; i++ ) * > > *{* > > * BufferedReader reader = new > BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(; > * > > * String data;* > > * data = reader.readLine();* > > * while ( data != null ) * > > * {* > > *output.write(data);* > > * }* > > *reader.close* > > *}* > > *output.close* > > > In case you have the files in multiple directories, call the code for each > of them with different input paths. > > Hope this helps! > > Cheers > > Arko > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia >wrote: > > > I am trying to look for examples that demonstrates using sequence files > > including writing to it and then running mapred on it, but unable to find > > one. Could you please point me to some examples of sequence files? > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks > wrote: > > > > > Hi Mohit > > > AFAIK XMLLoader in pig won't be suited for Sequence Files. Please > > > post the same to Pig user group for some workaround over the same. > > > SequenceFIle is a preferred option when we want to store small > > > files in hdfs and needs to be processed by MapReduce as it stores data > in > > > key value format.Since SequenceFileInputFormat is available at your > > > disposal you don't need any custom input formats for processing the > same > > > using map reduce. It is a cleaner and better approach compared to just > > > appending small xml file contents into a big file. > > > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia < > mohitanch...@gmail.com > > > >wrote: > > > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks > > > wrote: > > > > > > > > > Mohit > > > > > Rather than just appending the content into a normal text > file > > or > > > > > so, you can create a sequence file with the individual smaller file > > > > content > > > > > as values. > > > > > > > > > > Thanks. I was planning to use pig's > > > > org.apache.pig.piggybank.storage.XMLLoader > > > > for processing. Would it work with sequence file? > > > > > > > > This text file that I was referring to would be in hdfs itself. Is it > > > still > > > > different than using sequence file? > > > > > > > > > Regards > > > > > Bejoy.K.S > > > > > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia < > > > mohitanch...@gmail.com > > > > > >wrote: > > > > > > > > > > > We have small xml files. Currently I am planning to append these > > > small > > > > > > files to one file in hdfs so that I can take advantage of splits, > > > > larger > > > > > > blocks and sequential IO. What I am unsure is if it's ok to > append > > > one > > > > > file > > > > > > at a time to this hdfs file > > > > > > > > > > > > Could someone suggest if this is ok? Would like to know how other > > do > > > > it. > > > > > > > > > > > > > > > > > > > > >
Re: Writing small files to one big file in hdfs
Hi, Let's say all the smaller files are in the same directory. Then u can do: *BufferedWriter output = new BufferedWriter (newOutputStreamWriter(fs.create(output_path, true))); // Output path* *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // Input directory* *for ( int i=0; i < output_files.length; i++ ) * *{* * BufferedReader reader = new BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(; * * String data;* * data = reader.readLine();* * while ( data != null ) * * {* *output.write(data);* * }* *reader.close* *}* *output.close* In case you have the files in multiple directories, call the code for each of them with different input paths. Hope this helps! Cheers Arko On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia wrote: > I am trying to look for examples that demonstrates using sequence files > including writing to it and then running mapred on it, but unable to find > one. Could you please point me to some examples of sequence files? > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks wrote: > > > Hi Mohit > > AFAIK XMLLoader in pig won't be suited for Sequence Files. Please > > post the same to Pig user group for some workaround over the same. > > SequenceFIle is a preferred option when we want to store small > > files in hdfs and needs to be processed by MapReduce as it stores data in > > key value format.Since SequenceFileInputFormat is available at your > > disposal you don't need any custom input formats for processing the same > > using map reduce. It is a cleaner and better approach compared to just > > appending small xml file contents into a big file. > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia > >wrote: > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks > > wrote: > > > > > > > Mohit > > > > Rather than just appending the content into a normal text file > or > > > > so, you can create a sequence file with the individual smaller file > > > content > > > > as values. > > > > > > > > Thanks. I was planning to use pig's > > > org.apache.pig.piggybank.storage.XMLLoader > > > for processing. Would it work with sequence file? > > > > > > This text file that I was referring to would be in hdfs itself. Is it > > still > > > different than using sequence file? > > > > > > > Regards > > > > Bejoy.K.S > > > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia < > > mohitanch...@gmail.com > > > > >wrote: > > > > > > > > > We have small xml files. Currently I am planning to append these > > small > > > > > files to one file in hdfs so that I can take advantage of splits, > > > larger > > > > > blocks and sequential IO. What I am unsure is if it's ok to append > > one > > > > file > > > > > at a time to this hdfs file > > > > > > > > > > Could someone suggest if this is ok? Would like to know how other > do > > > it. > > > > > > > > > > > > > > >
Re: Writing small files to one big file in hdfs
I am trying to look for examples that demonstrates using sequence files including writing to it and then running mapred on it, but unable to find one. Could you please point me to some examples of sequence files? On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks wrote: > Hi Mohit > AFAIK XMLLoader in pig won't be suited for Sequence Files. Please > post the same to Pig user group for some workaround over the same. > SequenceFIle is a preferred option when we want to store small > files in hdfs and needs to be processed by MapReduce as it stores data in > key value format.Since SequenceFileInputFormat is available at your > disposal you don't need any custom input formats for processing the same > using map reduce. It is a cleaner and better approach compared to just > appending small xml file contents into a big file. > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia >wrote: > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks > wrote: > > > > > Mohit > > > Rather than just appending the content into a normal text file or > > > so, you can create a sequence file with the individual smaller file > > content > > > as values. > > > > > > Thanks. I was planning to use pig's > > org.apache.pig.piggybank.storage.XMLLoader > > for processing. Would it work with sequence file? > > > > This text file that I was referring to would be in hdfs itself. Is it > still > > different than using sequence file? > > > > > Regards > > > Bejoy.K.S > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia < > mohitanch...@gmail.com > > > >wrote: > > > > > > > We have small xml files. Currently I am planning to append these > small > > > > files to one file in hdfs so that I can take advantage of splits, > > larger > > > > blocks and sequential IO. What I am unsure is if it's ok to append > one > > > file > > > > at a time to this hdfs file > > > > > > > > Could someone suggest if this is ok? Would like to know how other do > > it. > > > > > > > > > >
Re: Writing small files to one big file in hdfs
You might want to check out File Crusher: http://www.jointhegrid.com/hadoop_filecrush/index.jsp I've never used it, but it sounds like it could be helpful. On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks wrote: > Hi Mohit > AFAIK XMLLoader in pig won't be suited for Sequence Files. Please > post the same to Pig user group for some workaround over the same. > SequenceFIle is a preferred option when we want to store small > files in hdfs and needs to be processed by MapReduce as it stores data in > key value format.Since SequenceFileInputFormat is available at your > disposal you don't need any custom input formats for processing the same > using map reduce. It is a cleaner and better approach compared to just > appending small xml file contents into a big file. > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia >wrote: > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks > wrote: > > > > > Mohit > > > Rather than just appending the content into a normal text file or > > > so, you can create a sequence file with the individual smaller file > > content > > > as values. > > > > > > Thanks. I was planning to use pig's > > org.apache.pig.piggybank.storage.XMLLoader > > for processing. Would it work with sequence file? > > > > This text file that I was referring to would be in hdfs itself. Is it > still > > different than using sequence file? > > > > > Regards > > > Bejoy.K.S > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia < > mohitanch...@gmail.com > > > >wrote: > > > > > > > We have small xml files. Currently I am planning to append these > small > > > > files to one file in hdfs so that I can take advantage of splits, > > larger > > > > blocks and sequential IO. What I am unsure is if it's ok to append > one > > > file > > > > at a time to this hdfs file > > > > > > > > Could someone suggest if this is ok? Would like to know how other do > > it. > > > > > > > > > > -- *Note that I'm no longer using my Yahoo! email address. Please email me at billgra...@gmail.com going forward.*
Re: Writing small files to one big file in hdfs
Hi Mohit AFAIK XMLLoader in pig won't be suited for Sequence Files. Please post the same to Pig user group for some workaround over the same. SequenceFIle is a preferred option when we want to store small files in hdfs and needs to be processed by MapReduce as it stores data in key value format.Since SequenceFileInputFormat is available at your disposal you don't need any custom input formats for processing the same using map reduce. It is a cleaner and better approach compared to just appending small xml file contents into a big file. On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia wrote: > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks wrote: > > > Mohit > > Rather than just appending the content into a normal text file or > > so, you can create a sequence file with the individual smaller file > content > > as values. > > > > Thanks. I was planning to use pig's > org.apache.pig.piggybank.storage.XMLLoader > for processing. Would it work with sequence file? > > This text file that I was referring to would be in hdfs itself. Is it still > different than using sequence file? > > > Regards > > Bejoy.K.S > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia > >wrote: > > > > > We have small xml files. Currently I am planning to append these small > > > files to one file in hdfs so that I can take advantage of splits, > larger > > > blocks and sequential IO. What I am unsure is if it's ok to append one > > file > > > at a time to this hdfs file > > > > > > Could someone suggest if this is ok? Would like to know how other do > it. > > > > > >
Re: Writing small files to one big file in hdfs
On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks wrote: > Mohit > Rather than just appending the content into a normal text file or > so, you can create a sequence file with the individual smaller file content > as values. > > Thanks. I was planning to use pig's > org.apache.pig.piggybank.storage.XMLLoader for processing. Would it work with sequence file? This text file that I was referring to would be in hdfs itself. Is it still different than using sequence file? > Regards > Bejoy.K.S > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia >wrote: > > > We have small xml files. Currently I am planning to append these small > > files to one file in hdfs so that I can take advantage of splits, larger > > blocks and sequential IO. What I am unsure is if it's ok to append one > file > > at a time to this hdfs file > > > > Could someone suggest if this is ok? Would like to know how other do it. > > >
Re: Writing small files to one big file in hdfs
Mohit Rather than just appending the content into a normal text file or so, you can create a sequence file with the individual smaller file content as values. Regards Bejoy.K.S On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia wrote: > We have small xml files. Currently I am planning to append these small > files to one file in hdfs so that I can take advantage of splits, larger > blocks and sequential IO. What I am unsure is if it's ok to append one file > at a time to this hdfs file > > Could someone suggest if this is ok? Would like to know how other do it. >
Re: Writing small files to one big file in hdfs
I'd recommend making a SequenceFile[1] to store each XML file as a value. -Joey [1] http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/SequenceFile.html On Tue, Feb 21, 2012 at 12:15 PM, Mohit Anchlia wrote: > We have small xml files. Currently I am planning to append these small > files to one file in hdfs so that I can take advantage of splits, larger > blocks and sequential IO. What I am unsure is if it's ok to append one file > at a time to this hdfs file > > Could someone suggest if this is ok? Would like to know how other do it. > -- Joseph Echeverria Cloudera, Inc. 443.305.9434