Thibaut_ wrote:
Hi,
As each row of my hbase table can take a lot of time to process (waiting on
answeres from other hosts), I would like to create a few threads to process
that data in parallel. I would then use the last call to the map function to
wait for all threads to finish their job and only return the last call to
the map function when everything is done and all threads exited.

See MapRunner up in Hadoop. Its the class that does the next, next, next on the custom hbase RecordReader. Sounds like you want your own MapRunner so you can do some handling after we've run off the end of the map's Region. You can set your won MapRunner on the JobConf.


How do I know when the last row is passed to my mapper function? (I'm
extending TableMap for my mapper as also done in the wiki examples) I didn't
find any function to check this.

I looked at overriding TableInputFormat catching the end row but end row is not inclusive so you can't trigger your cleanup when you see the map end row.

Another possibility would be to create more mapper jobs and let the hadoop
framework do the processing in parallel. However I read somewhere that each
mapper get's an entire region. In my case, the data in each row is very
small, so each mapper could get millions of rows (with the default
region/block size).
Run in parallel if you can. Yes, each map gets a region by default (though its possible to make it so a map can have more than one region if you supply a number-of-splits < number-of-regions -- see getSplits in TableInputFormatBase).

Do you have many regions? Do you want to make it so you can split a region into pieces? If so, override TableInputFormat or TableInputFormatBase and do your own getSplits implementation.

St.Ack


What would you do?

Thanks,
Thibaut

Reply via email to