Stack and J-D, Thanks for your responses.
It looks like the RetriesExhaustedException occurred during:
2008-10-23 11:08:55,180 INFO
org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
region ... 1224785065371 in 4mins, 25sec
It doesn't look like I am having the HBASE-921 issue (yet).
What settings can I change to cause the compaction to not take so long?
I found this setting:
<property>
<name>hbase.hstore.compactionThreshold</name>
<value>3</value>
<description>
If more than this number of HStoreFiles in any one HStore
(one HStoreFile is written per flush of memcache) then a compaction
is run to rewrite all HStoreFiles files as one. Larger numbers
put off compaction but when it runs, it takes longer to complete.
During a compaction, updates cannot be flushed to disk. Long
compactions require memory sufficient to carry the logging of
all updates across the duration of the compaction.
If too large, clients timeout during compaction.
</description>
</property>
Should I lower this or is there a better way?
Thanks,
Dru
On Oct 23, 2008, at 11:37 AM, Jean-Daniel Cryans wrote:
Dru.
See also if it's a case of
HBASE-921<https://issues.apache.org/jira/browse/HBASE-921>because it
would make sense if not using hbase 0.18.1 and under a heavy
load.
J-D
On Thu, Oct 23, 2008 at 2:30 PM, stack <[EMAIL PROTECTED]> wrote:
Find the MR task that failed. Click through the UI to look at its
logs.
It may have interesting info. Its probably complaining about a
region not
being available (NSRE). Figure which region it is. Use the region
historian or grep in the master logs -- 'grep -v metaScanner
REGIONNAME' so
you avoid the metaScanner noise -- to see if you can figure the
regions
history around the failure. Look too at loading around failure
time. Were
you swapping, etc. (Ganglia or some such helps here).
You might also test table is still wholesome -- that the MR job
didn't
damage the table. A quick check that all regions are onlined and
accessible
is to scan for a column whose column family does exist but whose
qualifier
you know is not present: e.g. if you have columnfamily 'page' and
you know
there is no column 'page:xyz', scan with that (Enable DEBUG in
log4j so you
can see regions being loaded as scan progresses): "scan 'TABLENAME',
['page:xyz']".
You might need to up the timeouts/retries.
St.Ack
Dru Jensen wrote:
Hi hbase-users,
During a fairly large MR process, on the Reduce cycle as its
writing its
results to a table, I see
org.apache.hadoop.hbase.NotServingRegionException
in the region server log several times and then I see a split
reporting it
was successful.
Eventually, the Reduce process fails with
org.apache.hadoop.hbase.client.RetriesExhaustedException after 10
failed
attempts.
What can I do to fix it?
Thanks,
Dru