Find the MR task that failed. Click through the UI to look at its logs. It may have interesting info. Its probably complaining about a region not being available (NSRE). Figure which region it is. Use the region historian or grep in the master logs -- 'grep -v metaScanner REGIONNAME' so you avoid the metaScanner noise -- to see if you can figure the regions history around the failure. Look too at loading around failure time. Were you swapping, etc. (Ganglia or some such helps here).

You might also test table is still wholesome -- that the MR job didn't damage the table. A quick check that all regions are onlined and accessible is to scan for a column whose column family does exist but whose qualifier you know is not present: e.g. if you have columnfamily 'page' and you know there is no column 'page:xyz', scan with that (Enable DEBUG in log4j so you can see regions being loaded as scan progresses): "scan 'TABLENAME', ['page:xyz']".

You might need to up the timeouts/retries.
St.Ack


Dru Jensen wrote:
Hi hbase-users,

During a fairly large MR process, on the Reduce cycle as its writing its results to a table, I see org.apache.hadoop.hbase.NotServingRegionException in the region server log several times and then I see a split reporting it was successful.

Eventually, the Reduce process fails with org.apache.hadoop.hbase.client.RetriesExhaustedException after 10 failed attempts.

What can I do to fix it?

Thanks,
Dru





Reply via email to