Hey Stephen:

On 1., OOME was in client?  When I see the 'all datanodes are bad' message,
it usually means hdfs has gone away.

In 2., you see 'No node available for block'.  This and the above would seem
to indicate you are suffering from a lack of
https://issues.apache.org/jira/browse/HDFS-127.

If you can shutdown hbase, then 3., is for sure the way to go -- its
complete and runs quickest.  I'm surprised though that it would complain of
missing blocks when fsck does not.

Can we get you to migrate to 0.20.0?
St.Ack



On Tue, Aug 18, 2009 at 3:25 AM, stephen mulcahy
<[email protected]>wrote:

> Hi,
>
> I'm a relative newcomer to both HBase and Hadoop so please bear with me if
> some of my queries don't make sense.
>
> I'm managing a small HBase cluster (1 dedicated master, 4 regionservers)
> and am currently attempting to take a backup of the data (we can regenerate
> the data in our HBase but it will take time). I've tried a number of
> different approaches (details below) - I'm wondering if I've missed an
> approach or whether the approach I'm using is the best. All comments
> welcome.
>
> I'm using HBase 0.19.3 running on top of Hadoop 0.19.1 and our HBase
> contains a single table with about 50 million rows.
>
> 1. Initially, I came across 
> http://issues.apache.org/jira/browse/HBASE-897which seemed like the ideal way 
> for us to backup our HBase installation
> while allowing it to continue running. I ran into a number of problems with
> this, which I suspect are due to my HBase cluster being underpowered (I
> first ran into OutOfMemory exceptions, after bumping the JVM max heap size
> on the client to 512MB - then I saw some java.lang.NullPointerException
> during the map phase - I'm not sure if these are due to resource issues on
> the HBase cluster or some underlying corruption in HBase).
>
> After adding the following to HBase
>
> export HBASE_OPTS="-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> -Xloggc:/home/hadoop/hbase/logs/gc-hbase.log" and setting
>
> <property>
>    <name>mapred.map.tasks</name>
>    <value>13</value>
>  </property>
>
>  <property>
>    <name>mapred.reduce.tasks</name>
>    <value>5</value>
>  </property>
>
> in the Hadoop config on the system submitting the backup job, it seemed to
> progress further, but ultimately died with various failures including the
> following,
>
> java.io.IOException: All datanodes 192.168.1.2:50010 are bad. Aborting...
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2444)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:1996)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160)
>
>
> which again suggests to me that maybe our cluster isn't beefy enough to run
> HBase and the M/R job required to do the backup.
>
> 2. Given the lack of success with the M/R backup - I figured I'd shutdown
> HBase and try a copyToLocal of the entire /hbase tree.
>
> This failed after a few minutes with the following error,
>
> 09/08/17 17:53:07 INFO hdfs.DFSClient: No node available for block:
> blk_7870832778982080356_55873
> file=/hbase/log_192.168.1.3_1240589781392_60020/hlog.dat.1241539463091
> 09/08/17 17:53:07 INFO hdfs.DFSClient: Could not obtain block
> blk_7870832778982080356_55873 from any node: java.io.IOException: No
> live nodes contain current block
>
> (and a bunch of other errors - all the same). This suggests to me that
> there is some issue with our HBase and that some corruption has occured.
> Looking in JIRA, there seem to be a few instances where this can occur in
> 0.19.3 / 0.19.1. I tried running HDFS fsck - but it reports the entire
> filesystem as healthy. Is there anything I can run to force HBase to verify
> it's integrity and drop any rows affected by the above problem?
>
> 3. Having failed with the copyToLocal, plan C was to try a -distcp to
> another cluster. Initially efforts with -distcp failed with errors about bad
> blocks again. I tried running -distcp with the -i option (to ignore errors)
> and the copy completed. I've configured up Hbase on the copy destination to
> use the copied hbase tree and it seems to start ok. I'm currently running a
> count against the copied hbase table to see how different it is from the
> original. Does it seem likely that my copy is corrupt or will Hbase handle
> the missing blocks gracefully? How do other people verify the integrity of
> their HBase? Are there tools like fsck which can be run at the HBase level?
>
> Any comments on my approach to backups welcome, as I say, I'm far from the
> top of this particular learning curve!
>
> thanks,
>
> -stephen
>
> --
> Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
> NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
> http://di2.deri.ie    http://webstar.deri.ie    http://sindice.com
>

Reply via email to