Great info :) Sent from my iPhone
On Mar 21, 2012, at 9:10 AM, Jane Wayne <[email protected]> wrote: > if anyone is facing the same problem, here's what i did. i took anil's > advice to use NLineInputFormat (because that approach would scale out my > mappers). > > however, i am using the new mapreduce package/API in hadoop v0.20.2. i > notice that you cannot use NLineInputFormat from the old package/API > (mapred). > > when i took a look at hadoop v1.0.1, there is a NLineInputFormat class for > the new API. i simply copied and pasted this file into my project. i got 4 > errors associated with import statements and annotations. when i removed > the 2 import statements and corresponding 2 annotations, the class compiled > successfully. after this modification, running NLineInputFormat of v1.0.1 > on a cluster based on v0.20.2, works. > > one mini-problem solved, many more to go. > > thanks for the help. > > On Wed, Mar 21, 2012 at 3:33 AM, Jane Wayne <[email protected]>wrote: > >> as i understand, that class does not exist for new API in hadoop v0.20.2 >> (which is what i am using). if i am mistaken, where is it? >> >> i am looking at hadoop v1.0.1, and there is a NLineInputFormat class. i >> wonder if i can simply copy/paste this into my project. >> >> >> On Wed, Mar 21, 2012 at 2:37 AM, Anil Gupta <[email protected]> wrote: >> >>> Have a look at NLineInputFormat class in Hadoop. That class will solve >>> your purpose. >>> >>> Best Regards, >>> Anil >>> >>> On Mar 20, 2012, at 11:07 PM, Jane Wayne <[email protected]> >>> wrote: >>> >>>> i have a matrix that i am performing operations on. it is 10,000 rows by >>>> 5,000 columns. the total size of the file is just under 30 MB. my HDFS >>>> block size is set to 64 MB. from what i understand, the number of >>> mappers >>>> is roughly equal to the number of HDFS blocks used in the input. i.e. >>> if my >>>> input data spans 1 block, then only 1 mapper is created, if my data >>> spans 2 >>>> blocks, then 2 mappers will be created, etc... >>>> >>>> so, with my 1 matrix file of 15 MB, this won't fill up a block of data, >>> and >>>> being as such, only 1 mapper will be called upon the data. is this >>>> understanding correct? >>>> >>>> if so, what i want to happen is for more than one mapper (let's say 10) >>> to >>>> work on the data, even though it remains on 1 block. my analysis (or >>>> map/reduce job) is such that +1 mappers can work on different parts of >>> the >>>> matrix. for example, mapper 1 can work on the first 500 rows, mapper 2 >>> can >>>> work on the next 500 rows, etc... how can i set up multiple mappers (+1 >>>> mapper) to work on a file that resides only one block (or a file whose >>> size >>>> is smaller than the HDFS block size). >>>> >>>> can i split the matrix into (let's say) 10 files? that will mean 30 MB >>> / 10 >>>> = 3 MB per file. then put each 3 MB file onto HDFS ? will this increase >>> the >>>> chance of having multiple mappers work simultaneously on the >>> data/matrix? >>>> if i can increase the number of mappers, i think (pretty sure) my >>>> implementation will improve in speed linearly. >>>> >>>> any help is appreciated. >>> >> >>
