Re: About Issue PIG-841

Dmitriy Ryaboy Sun, 13 Mar 2011 15:00:06 -0700

Renato,

Here's how file reading works in Hadoop (pretty detailed, but I actually
simplified a few spots in this description -- one can get pretty creative by
providing own implementations of certain interfaces).

First, a little about HDFS.

A file is physically stored in HDFS as a collection of blocks. These blocks
are of a configurable size, usually 64 or 128 megabytes. Each of these
blocks is stored on a number of nodes -- 3 by default, but you can set the
replication factor to a different value.

The Hadoop NameNode is responsible for keeping track of which blocks form
which files, and in what order.

When you start reading a file from HDFS, the framework transparently
switches between feeding you new blocks as needed, so you can just read the
whole file in one long stram without worrying about which nodes the data
lives on, which block to read next, etc.

Now, InputFormats.

An input format is responsible for two things in Hadoop. It should:
1) Give Hadoop InputSplits, which are objects that represent part of the
input you wish to read, and may provide hints as to the most desirable nodes
for these splits to be mapped on.
2) Give hadoop a RecordReader for each split. RecordReader is the thing that
actually knows how to read individual records out of a byte stream.

In the case of the FileInputFormat, which is what underlies most file
reading in Hadoop, including Pig's PigStorage and friends, what happens is
as follows:

During job creation, Hadoop asks the FileInputFormat for splits. The
FileInputFormat returns a list of splits that contain the name of the file
being scanned, and a pair of byte offsets -- "start reading from byte x,
finish at byte y."

Hadoop spins up a new job, and sets up a Map task for each of the splits it
was given.

On the mappers, the FileInputFormat is asked for a RecordReader to read this
stuff.  The record reader is initialized with the spit currently being
processed, and asked to provide the next record (key-value, actually, but
don't worry about that) in a loop, until the reader says there are no more
records to return.

What does the Reader do, in the case of a FileInputFormat split? Given a
file name and the offset to start reading from, it gets a file handle using
the HDFS api, seeks to the start position it was asked to start reading
from, and starts reading in records until it hits its end position.

I'll provide a toy example. Let's say we have a 1 gigabyte file, and we want
128 meg splits. We will get 8 of them -- 0 through 128M, 128M through 256M,
etc. Since the inputsplits dictate where to stop reading, readers don't
re-read the same data.

There is a problem here, in that so far we've been talking about bytes, not
records. What happens if the file we are reading has a record that spans a
block boundary? Record readers are responsible for making sure the same
record is not read twice by different readers. What is usually done is that
when a reader hits the end of its allocated split, it continues reading
until it hits the end of a record. When a record reader starts reading a
split, it discards all data it sees in the beginning of the split until it
sees the end of a record, as that bit was read by the previous reader. This
is why some file formats are called "unsplittable" -- if you can't start
reading bits in an arbitrary location of a file and safely determine where
records start and end, this whole paradigm breaks.

As for the files where all of this is implemented, start by reading
PigStorage from the Pig side, or InputFormat / FileInputFormat /
TextInputFormat in Hadoop.

D

On Sun, Mar 13, 2011 at 11:22 AM, Renato Marroquín Mogrovejo <
renatoj.marroq...@gmail.com> wrote:

> Thanks for answering Daniel!
> But there are a couple of things I don't quite get. For example, you say
> that each mapper will be reading just configured amount of rows, but
> wouldn't the mappers end up reading the whole file? And my other question
> is
> if this is implemented, do you know in which classes this is?
> Thanks in advance.
>
> Renato M.
>
>
> 2011/3/7 Daniel Dai <jiany...@yahoo-inc.com>
>
> > Sampling job in Pig is used in "order by" and "skewed join". It will be
> > translated to a single map-reduce job. In the map, we sample the data
> with a
> > configurable interval; in the reduce, we do a "group all" followed by a
> > nested foreach. Within foreach, we do a nested sort and then feed the
> result
> > to UDF ("order by" and "skewed join" use different UDF)
> >
> > In PIG-1038, we will optimize nested sort using hadoop secondary sort if
> > possible. Sampling job fits in the bill. So PIG-841 is fixed
> automatically.
> >
> > Daniel
> >
> >
> > On 03/05/2011 12:54 PM, Renato Marroquín Mogrovejo wrote:
> >
> >> Hey does anybody know if PIG-841 was developed? And if it was, how is it
> >> being used by Pig?
> >> Thanks in advance.
> >>
> >> Renato M.
> >>
> >
> >
>

Re: About Issue PIG-841

Reply via email to