Re: NotServingRegionException - Map/Reduce process fails

Dru Jensen Thu, 23 Oct 2008 13:10:18 -0700

I do not see any swapping. I have a 3 node cluster with 8GB memoryand 4 cpu's each and 2TB HDFS. Node 1 is acting as master.

I am reducing 32M map results into about 2M rows, several columnfamilies with 10's of columns each. I am writing them to a tableusing TableReduce class.

greping for compaction in the regionserver log, it is progressivelygetting longer from seconds to minutes and then to the 4mins, 25secbefore it failed the Reduce cycle and started over.


  <property>
    <name>hbase.regionserver.lease.period</name>
    <value>120000</value>
  </property>
  <property>
    <name>hbase.hregion.max.filesize</name>
    <value>536870912</value>
  </property>
  <property>
    <name>dfs.datanode.socket.write.timeout</name>
    <value>0</value>
  </property>
  <property>
    <name>dfs.datanode.max.xcievers</name>
    <value>1023</value>
  </property>

I have increased the region file size to 512MB to support the largerows.

I was hitting the same issue as Jean-Adrien premature socket error soI added the timeout and xcievers settings recently. I haven't seenthe Premature socket issue anymore with the setting changes but now Iam hitting this one.

I checked the META and verified that all the online regions existedin the table regions but to make sure I will upgrade to 0.18.1 toavoid the HBASE-921 issue.


Should I change the flush size? compaction threshold?

thanks,
Dru


On Oct 23, 2008, at 12:20 PM, stack wrote:

Dru: If compactions are taking 4minutes, then your instance is beingoverrun; its unable to keep up with your rate of upload. Whats yourupload rate like? How are you doing it? Or is it that your serversare buckled carrying the load? Are they swapping? Usuallycompaction runs fast. It'll take long if its compacting many morethan the threshold. Grep your logs and see if compactions aretaking steadily longer? Do you have a lot of blocking happening inyour logs (where the regionserver puts up temporary block of updatesbecause it isn't able to flush fast enough). You're on recenthbase? Have you altered flush or maximum region file sizes?
St.Ack

Dru Jensen wrote:
Stack and J-D, Thanks for your responses.

It looks like the RetriesExhaustedException occurred during:
2008-10-23 11:08:55,180 INFOorg.apache.hadoop.hbase.regionserver.HRegion: compaction completedon region ... 1224785065371 in 4mins, 25sec
It doesn't look like I am having the HBASE-921 issue (yet).
What settings can I change to cause the compaction to not take solong?
I found this setting:

<property>
   <name>hbase.hstore.compactionThreshold</name>
   <value>3</value>
   <description>
   If more than this number of HStoreFiles in any one HStore
(one HStoreFile is written per flush of memcache) then acompaction
   is run to rewrite all HStoreFiles files as one.  Larger numbers
   put off compaction but when it runs, it takes longer to complete.
   During a compaction, updates cannot be flushed to disk.  Long
   compactions require memory sufficient to carry the logging of
   all updates across the duration of the compaction.

   If too large, clients timeout during compaction.
   </description>
</property>

Should I lower this or is there a better way?

Thanks,
Dru

On Oct 23, 2008, at 11:37 AM, Jean-Daniel Cryans wrote:
Dru.

See also if it's a case of
HBASE-921<https://issues.apache.org/jira/browse/HBASE-921>because it
would make sense if not using hbase 0.18.1 and under a heavy
load.

J-D

On Thu, Oct 23, 2008 at 2:30 PM, stack <[EMAIL PROTECTED]> wrote:
Find the MR task that failed. Click through the UI to look atits logs.It may have interesting info. Its probably complaining about aregion not
being available (NSRE).  Figure which region it is.  Use the region
historian or grep in the master logs -- 'grep -v metaScannerREGIONNAME' soyou avoid the metaScanner noise -- to see if you can figure theregionshistory around the failure. Look too at loading around failuretime. Were
you swapping, etc. (Ganglia or some such helps here).
You might also test table is still wholesome -- that the MR jobdidn'tdamage the table. A quick check that all regions are onlined andaccessibleis to scan for a column whose column family does exist but whosequalifieryou know is not present: e.g. if you have columnfamily 'page' andyou knowthere is no column 'page:xyz', scan with that (Enable DEBUG inlog4j so youcan see regions being loaded as scan progresses): "scan'TABLENAME',
['page:xyz']".

You might need to up the timeouts/retries.
St.Ack



Dru Jensen wrote:
Hi hbase-users,
During a fairly large MR process, on the Reduce cycle as itswriting itsresults to a table, I seeorg.apache.hadoop.hbase.NotServingRegionExceptionin the region server log several times and then I see a splitreporting it
was successful.

Eventually, the Reduce process fails with
org.apache.hadoop.hbase.client.RetriesExhaustedException after10 failed
attempts.

What can I do to fix it?

Thanks,
Dru

Re: NotServingRegionException - Map/Reduce process fails

Reply via email to