Yaron,

That would certainly seem to be the easy way out, with the only
negative side being that you'd have to cache your values in memory.

If you plug deeper down into the RecordReader levels (which provide
the specific nextKV(…) methods), you can perhaps keep just a list of
offsets of all successful line matches and re-read the whole split in
the second run. This would cost you slightly higher I/O as you seek
through once again, but the benefit would be lower memory consumption
-- if that can be a concern here.

[Or go the longer way, and use the Reducer phase!]

On Wed, Oct 12, 2011 at 5:14 PM, Yaron Gonen <yaron.go...@gmail.com> wrote:
> Thanks for the fast reply!
> I've dug in the code a little bit, and it seems to me that I can achieve my
> goal by overloading Mapper.run method: just iterate over the whole split by
> using context.nextKeyValue() and then call map only with the values I need.
> Since I'm a novice Hadooper, am I thinking it the wrong way?
>
> thanks again,
> yaron
>
> On Wed, Oct 12, 2011 at 12:44 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hello Yaron,
>>
>> Yes, this is possible to do.
>>
>> You need to plug in your own RecordReader implementation into the job,
>> to control the emits and the action done before feeding key-value pair
>> data into map(…).
>>
>> On Wed, Oct 12, 2011 at 2:42 PM, Yaron Gonen <yaron.go...@gmail.com>
>> wrote:
>> > Hi,
>> > The map method in the Mapper gets as a parameter a single line from the
>> > split. Is there a way for Mappers to get the whole split as input?
>> > I'd like to scan the whole split before I decide which key-value pairs
>> > to
>> > emit to the reducer.
>> > Thanks
>> > yaron
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Reply via email to