Ted, You may want to consider LZO compression, which allows splitting for a comporessed file for Map jobs. On the other hand, gzip is not splittable.
Check out these links. http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/ http://wiki.apache.org/hadoop/UsingLzoCompression On Fri, Jan 8, 2010 at 1:13 PM, Ted Yu <yuzhih...@gmail.com> wrote: > The input file is in .gz format > FYI > > On Fri, Jan 8, 2010 at 11:08 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > My current project processes input file of size 333302161 bytes. > > What I plan to do is to split the file into equal size pieces (and on > blank > > line boundary) to improve performance. > > > > I found 12 classes in 0.20.1 source code which implement InputSplit. > > > > If someone has written code similar to what I plan to do, please share > some > > hint. > > > > Thanks > > > > > > On Fri, Jan 8, 2010 at 2:27 AM, Amogh Vasekar <am...@yahoo-inc.com> > wrote: > > > >> Hi, > >> The deprecation is due to the new evolving mapreduce ( o.a.h.mapreduce ) > >> APIs. Old APIs are supported for available distributions. The equivalent > of > >> TextInputFormat is available in new API : > >> > >> > >> > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html > >> > >> Thanks, > >> Amogh > >> > >> > >> On 1/8/10 3:47 AM, "Ted Yu" <yuzhih...@gmail.com> wrote: > >> > >> According to: > >> > >> > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html#isSplitable%28org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path%29 > >> > >> isSplitable() is deprecated. > >> > >> Which method should I use to replace it ? > >> > >> Thanks > >> > >> > > >