Renato, Here's how file reading works in Hadoop (pretty detailed, but I actually simplified a few spots in this description -- one can get pretty creative by providing own implementations of certain interfaces).
First, a little about HDFS. A file is physically stored in HDFS as a collection of blocks. These blocks are of a configurable size, usually 64 or 128 megabytes. Each of these blocks is stored on a number of nodes -- 3 by default, but you can set the replication factor to a different value. The Hadoop NameNode is responsible for keeping track of which blocks form which files, and in what order. When you start reading a file from HDFS, the framework transparently switches between feeding you new blocks as needed, so you can just read the whole file in one long stram without worrying about which nodes the data lives on, which block to read next, etc. Now, InputFormats. An input format is responsible for two things in Hadoop. It should: 1) Give Hadoop InputSplits, which are objects that represent part of the input you wish to read, and may provide hints as to the most desirable nodes for these splits to be mapped on. 2) Give hadoop a RecordReader for each split. RecordReader is the thing that actually knows how to read individual records out of a byte stream. In the case of the FileInputFormat, which is what underlies most file reading in Hadoop, including Pig's PigStorage and friends, what happens is as follows: During job creation, Hadoop asks the FileInputFormat for splits. The FileInputFormat returns a list of splits that contain the name of the file being scanned, and a pair of byte offsets -- "start reading from byte x, finish at byte y." Hadoop spins up a new job, and sets up a Map task for each of the splits it was given. On the mappers, the FileInputFormat is asked for a RecordReader to read this stuff. The record reader is initialized with the spit currently being processed, and asked to provide the next record (key-value, actually, but don't worry about that) in a loop, until the reader says there are no more records to return. What does the Reader do, in the case of a FileInputFormat split? Given a file name and the offset to start reading from, it gets a file handle using the HDFS api, seeks to the start position it was asked to start reading from, and starts reading in records until it hits its end position. I'll provide a toy example. Let's say we have a 1 gigabyte file, and we want 128 meg splits. We will get 8 of them -- 0 through 128M, 128M through 256M, etc. Since the inputsplits dictate where to stop reading, readers don't re-read the same data. There is a problem here, in that so far we've been talking about bytes, not records. What happens if the file we are reading has a record that spans a block boundary? Record readers are responsible for making sure the same record is not read twice by different readers. What is usually done is that when a reader hits the end of its allocated split, it continues reading until it hits the end of a record. When a record reader starts reading a split, it discards all data it sees in the beginning of the split until it sees the end of a record, as that bit was read by the previous reader. This is why some file formats are called "unsplittable" -- if you can't start reading bits in an arbitrary location of a file and safely determine where records start and end, this whole paradigm breaks. As for the files where all of this is implemented, start by reading PigStorage from the Pig side, or InputFormat / FileInputFormat / TextInputFormat in Hadoop. D On Sun, Mar 13, 2011 at 11:22 AM, Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com> wrote: > Thanks for answering Daniel! > But there are a couple of things I don't quite get. For example, you say > that each mapper will be reading just configured amount of rows, but > wouldn't the mappers end up reading the whole file? And my other question > is > if this is implemented, do you know in which classes this is? > Thanks in advance. > > Renato M. > > > 2011/3/7 Daniel Dai <jiany...@yahoo-inc.com> > > > Sampling job in Pig is used in "order by" and "skewed join". It will be > > translated to a single map-reduce job. In the map, we sample the data > with a > > configurable interval; in the reduce, we do a "group all" followed by a > > nested foreach. Within foreach, we do a nested sort and then feed the > result > > to UDF ("order by" and "skewed join" use different UDF) > > > > In PIG-1038, we will optimize nested sort using hadoop secondary sort if > > possible. Sampling job fits in the bill. So PIG-841 is fixed > automatically. > > > > Daniel > > > > > > On 03/05/2011 12:54 PM, Renato Marroquín Mogrovejo wrote: > > > >> Hey does anybody know if PIG-841 was developed? And if it was, how is it > >> being used by Pig? > >> Thanks in advance. > >> > >> Renato M. > >> > > > > >