Find the MR task that failed. Click through the UI to look at its
logs. It may have interesting info. Its probably complaining about a
region not being available (NSRE). Figure which region it is. Use the
region historian or grep in the master logs -- 'grep -v metaScanner
REGIONNAME' so you avoid the metaScanner noise -- to see if you can
figure the regions history around the failure. Look too at loading
around failure time. Were you swapping, etc. (Ganglia or some such
helps here).
You might also test table is still wholesome -- that the MR job didn't
damage the table. A quick check that all regions are onlined and
accessible is to scan for a column whose column family does exist but
whose qualifier you know is not present: e.g. if you have columnfamily
'page' and you know there is no column 'page:xyz', scan with that
(Enable DEBUG in log4j so you can see regions being loaded as scan
progresses): "scan 'TABLENAME', ['page:xyz']".
You might need to up the timeouts/retries.
St.Ack
Dru Jensen wrote:
Hi hbase-users,
During a fairly large MR process, on the Reduce cycle as its writing
its results to a table, I see
org.apache.hadoop.hbase.NotServingRegionException in the region server
log several times and then I see a split reporting it was successful.
Eventually, the Reduce process fails with
org.apache.hadoop.hbase.client.RetriesExhaustedException after 10
failed attempts.
What can I do to fix it?
Thanks,
Dru