Ted,

You may want to consider LZO compression, which allows splitting for a
comporessed file for Map jobs.  On the other hand, gzip is not splittable.

Check out these links.

http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/
http://wiki.apache.org/hadoop/UsingLzoCompression


On Fri, Jan 8, 2010 at 1:13 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> The input file is in .gz format
> FYI
>
> On Fri, Jan 8, 2010 at 11:08 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
> > My current project processes input file of size 333302161 bytes.
> > What I plan to do is to split the file into equal size pieces (and on
> blank
> > line boundary) to improve performance.
> >
> > I found 12 classes in 0.20.1 source code which implement InputSplit.
> >
> > If someone has written code similar to what I plan to do, please share
> some
> > hint.
> >
> > Thanks
> >
> >
> > On Fri, Jan 8, 2010 at 2:27 AM, Amogh Vasekar <am...@yahoo-inc.com>
> wrote:
> >
> >> Hi,
> >> The deprecation is due to the new evolving mapreduce ( o.a.h.mapreduce )
> >> APIs. Old APIs are supported for available distributions. The equivalent
> of
> >> TextInputFormat is available in new API :
> >>
> >>
> >>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html
> >>
> >> Thanks,
> >> Amogh
> >>
> >>
> >> On 1/8/10 3:47 AM, "Ted Yu" <yuzhih...@gmail.com> wrote:
> >>
> >> According to:
> >>
> >>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html#isSplitable%28org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path%29
> >>
> >> isSplitable() is deprecated.
> >>
> >> Which method should I use to replace it ?
> >>
> >> Thanks
> >>
> >>
> >
>

Reply via email to