Hi,
   I also have similar kind of question.

Is it possible for a job to start reading a file ( start split ) from a specific position in file rather than beginning. Idea is I have some information in a file, first part of information can only be read in sequence not parallel, so I get this data with isSplitable= false. Now when I have this data next part of search can be made parallel. So its better to start searching from this point onward in parallel with multiple mappers ( isSiplitable = true ).

BR,
Bari

--------------------------------------------------
From: "Amandeep Khurana" <[email protected]>
Sent: Thursday, September 10, 2009 9:49 AM
To: <[email protected]>
Subject: Re: Reading a subset of records from hdfs

Why not just have a higher number of mappers? Why split into multiple
jobs? Any particular case that you think this will be useful in?

On 9/9/09, Rakhi Khatwani <[email protected]> wrote:
Hi,
Suppose i have a hdfs file with 10,000 entries. and i want my job to
process 100 records at one time (to minimize loss of data during job
crashes/ network errors etc). so if a job can read a subset of records from
a fine in HDFS, i can combine with chaining to achieve my objective.  for
example i have job1 which reads 1-100 lines of input from hdfs, and job 2
which reads from 101-200 lines of input...etc.
 is there a way in which you can configure a job 2 read only a subset of
records from a file in HDFS.
Regards,
Raakhi



--


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Reply via email to