Re: How to detect when the mapper is called the last time?

Michael Stack Sun, 16 Nov 2008 12:50:32 -0800

Thibaut_ wrote:

Hi,

As each row of my hbase table can take a lot of time to process (waiting on
answeres from other hosts), I would like to create a few threads to process

that data in parallel.I would then use the last call to the map function to

wait for all threads to finish their job and only return the last call to
the map function when everything is done and all threads exited.

See MapRunner up in Hadoop. Its the class that does the next, next,next on the custom hbase RecordReader. Sounds like you want your ownMapRunner so you can do some handling after we've run off the end of themap's Region. You can set your won MapRunner on the JobConf.

How do I know when the last row is passed to my mapper function? (I'm
extending TableMap for my mapper as also done in the wiki examples) I didn't
find any function to check this.

I looked at overriding TableInputFormat catching the end row but end rowis not inclusive so you can't trigger your cleanup when you see the mapend row.

Another possibility would be to create more mapper jobs and let the hadoop
framework do the processing in parallel. However I read somewhere that each
mapper get's an entire region. In my case, the data in each row is very
small, so each mapper could get millions of rows (with the default

region/block size).

Run in parallel if you can. Yes, each map gets a region by default(though its possible to make it so a map can have more than one regionif you supply a number-of-splits < number-of-regions -- see getSplits inTableInputFormatBase).

Do you have many regions? Do you want to make it so you can split aregion into pieces? If so, override TableInputFormat orTableInputFormatBase and do your own getSplits implementation.


St.Ack

What would you do?

Thanks,
Thibaut

Re: How to detect when the mapper is called the last time?

Reply via email to