Even for a single machine (and there may be reasons to use a single machine if the original data is not splittable) Our experience suggests it should take about an hour to process 32 GB on a single machine leading me to wonder whether writing the Sequence file is your limiting step - Consider very simple job which writes 32 GB of random data - say a Long count and a random double to a Sequence file and run it on one box (you might also try the same steps without the write) and see if you are really being limited by the write. You might also consider compression while writing the sequence file
2011/5/12 丛林 <congli...@gmail.com> > Dear Harsh, > > Will you please explain how to create a sequence file in the way of > mapreduce? > > Suppose that all 32G little file stored in one PC. > > Thanks for your suggestion. > > BTW: I notice that you repeated most of the topic of sequence file in > this mail-list :-) > > Best Wishes, > > -Lin > > > 2011/5/12 Harsh J <ha...@cloudera.com>: > > Are you doing this as a MapReduce job or is it a simple linear > > program? MapReduce could be much faster (Combined-files input format, > > with a few Reducers for merging if you need that as well). > > > > On Thu, May 12, 2011 at 5:18 AM, 丛林 <congli...@gmail.com> wrote: > >> Hi, all. > >> > >> I want to write lots of little files (32GB) to HDFS as > >> org.apache.hadoop.io.SequenceFile. > >> > >> But now it is too slow: we use about 8 hours to create this > >> SequenceFile (6.7GB). > >> > >> So I wonder how to create this SequenceFile more faster? > >> > >> Thanks for your suggestion. > >> > >> -Best Wishes, > >> > >> -Lin > >> > > > > > > > > -- > > Harsh J > > > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com