Re: what is the efficient way to implement InputFormat

Zhengguo 'Mike' SUN Tue, 02 Jun 2009 19:58:13 -0700

Thanks Aaron. I guess I have to adjust the format of my input file to make it 
more suitable for MapReduce processing.

________________________________
From: Aaron Kimball <[email protected]>
To: [email protected]
Sent: Monday, June 1, 2009 8:35:09 PM
Subject: Re: what is the efficient way to implement InputFormat

The major problem with this model is that an InputSplit is the unit of work
that comprises a map task. So no map tasks can be created before you know
how many InputSplits there are, and what their nature is. This means that
the creation of input splits is performed on the client. This is inherently
single-threaded (or at least, single-node); and if the client machine is
physically far away from the compute cluster, requires that the cluster pass
a lot of data to a single node. Given that MapReduce is designed to allow
for reading files in parallel via many data-local map tasks, this is
something of an anti-pattern. You want to delay reading the data files until
after the task starts.

To read through your data in a high-performance way, you're going to have to
rethink how you compute the input work units. Small tweaks to your data
format may be all you need.

If one of your two files has lines or other delimited records which
reference the other file, maybe you can use the regular TextInputFormat on
the first file, and then just do HDFS reads on the second file from within
your mappers.

Remember, the TextInputFormat doesn't magically know where newlines are. It
accepts that there's fuzziness. It creates splits based purely on block
boundaries, but the borders are a bit elastic. If a line continues past the
end of a block boundary, the previous task will keep reading to the end of
the line. And the next task seeks to the block boundary, but then doesn't
use any text until it finds a newline and has "aligned" on a new record.

If your data format can be modified so that you can put in some sort of
record delimiters in one of the files, then you can leverage similar
behavior in your InputFormat.

Or else if you know how many splits you are going to have ahead of time
(Even without knowing their exact contents), you can just write out a file
that contains the numbers 1 ... N where N is the number of splits you have.
Then use NLineInputFormat to read this file. Each mapper then gets its
"split id" as its official record, then you use the FileSystem interface to
access the i'th split of data from within map task i (0 < i <= N).

I hope these suggestions help.
Good luck,
- Aaron

On Mon, Jun 1, 2009 at 4:57 PM, Kunsheng Chen <[email protected]> wrote:

>
> Hi Sun,
>
> Sounds you are dealing with I met before.
>
> The format of my file is something like this, there is only a space between
> each element
>
> source destination degree timestamp.
>
> for example:
>
> http://www.google.com http://something.net 3 1234555
>
>
> The source is my key for map and reduce,
>
> I just use   'String [] splits = value.toString().split(" ");' to split
> everything.
>
>
> Maybe you are looking for something more complicated, hope this helps.
>
>
>
>
>
> --- On Mon, 6/1/09, Zhengguo 'Mike' SUN <[email protected]> wrote:
>
> > From: Zhengguo 'Mike' SUN <[email protected]>
> > Subject: what is the efficient way to implement InputFormat
> > To: [email protected]
> > Date: Monday, June 1, 2009, 8:52 PM
> > Hi, All,
> >
> > The input of my MapReduce job is two large txt files. And
> > an InputSplit consists of a portion of the file from both
> > files. And this Split is content dependent. So I have to
> > read the input file to generate a split. Now the thing is
> > that most of the time is spent in generating these splits.
> > The Map and Reduce phases actually take less time than that.
> > I was wondering if there is an efficient way to generate
> > splits from files. My InputFormat class is based on
> > FileInputFormat. The getSplits function of FileInputFormat
> > doesn't read input file. But this is impossible for me
> > because my split depends on the content of the file.
> >
> > Any ideas or comments are appreciated.
> >
> >
> >
>
>
>
>

Re: what is the efficient way to implement InputFormat

Reply via email to