Re: Reading a subset of records from hdfs

Wasim Bari Thu, 10 Sep 2009 00:57:37 -0700

Hi,
   I also have similar kind of question.

Is it possible for a job to start reading a file ( start split ) from aspecific position in file rather than beginning. Idea is I have someinformation in a file, first part of information can only be read insequence not parallel, so I get this data with isSplitable= false. Now whenI have this data next part of search can be made parallel. So its better tostart searching from this point onward in parallel with multiple mappers (isSiplitable = true ).


BR,
Bari

--------------------------------------------------
From: "Amandeep Khurana" <[email protected]>
Sent: Thursday, September 10, 2009 9:49 AM
To: <[email protected]>
Subject: Re: Reading a subset of records from hdfs

Why not just have a higher number of mappers? Why split into multiple
jobs? Any particular case that you think this will be useful in?

On 9/9/09, Rakhi Khatwani <[email protected]> wrote:

Hi,

Suppose i have a hdfs file with 10,000 entries. and i want my jobto

process 100 records at one time (to minimize loss of data during job

crashes/ network errors etc). so if a job can read a subset of recordsfrom

a fine in HDFS, i can combine with chaining to achieve my objective.  for
example i have job1 which reads 1-100 lines of input from hdfs, and job 2
which reads from 101-200 lines of input...etc.
 is there a way in which you can configure a job 2 read only a subset of
records from a file in HDFS.
Regards,
Raakhi



--


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Re: Reading a subset of records from hdfs

Reply via email to