These client error messages are not particular descriptive as to the root cause (they are fatal errors, or close to it).

What is going on in your regionservers when these errors happen? Check the master and RS logs.

Also, you definitely do not want 19 zookeeper nodes. Reduce that to 3 or 5 max.

What is the hardware you are using for these nodes, and what settings do you have for heap/GC?

JG

Zhenyu Zhong wrote:
Stack,

Thank you very much for your comments.
I am running a cluster with 20 nodes. I set 19 as both regionserver and
zookeeper quorums.
The versions I am using are  Hadoop0.20.1 and HBase0.20.1.
I started with an empty table and try to load 200 million records into that
empty table.
There is a key in each record. Logically, in my MR program, during the
setup, I opened an HTable, in my mapper, I fetch the record from HTable via
key in the record, then make some changes to the columns and update that row
back to HTable through TableOutputFormat by passing a put. There is no
reduce tasks involved here.  (Though it is unnecessary to fetch row from an
empty table, I just intended to do that)

Additionally, when I reduced the number of regionservers and number of
zookeeper quorums.
I had different errors:
org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying
to locate root region at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at
org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at
org.apache.hadoop.mapred.Child.main(Child.java:170)

Many thanks in advance.
zhenyu




On Wed, Oct 28, 2009 at 12:39 PM, stack <[email protected]> wrote:

Whats your cluster topology?  How many nodes involved?  When you see the
below message, how many regions in your table?  How are you loading your
table?
Thanks,
St.Ack

On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong <[email protected]
wrote:
Nitay,

I am very appreciated.

As Ryan suggested, I increased the zookeeper session timeout to 40seconds
along with the GC options -XX:ParallelGCThreads=8
 -XX:+UseConcMarkSweepGC
in place. I set the Heapsize to 4GB.  I also set the vm.swappiness=0.

However it still ran into problem. Please find the following errors.

org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
contact region server x.x.x.x:60021 for region
YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after 10
attempts.
Exceptions:
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1

       at

org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001)
       at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413)


The input file is about 10GB around 200million rows of data.
This load doesn't seem too large. However this kind of errors keep
popping
up.

Does Regionserver need to be deployed to dedicated machines?
Does Zookeeper need to be deployed to dedicated machines as well?

Best,
zhenyu



On Wed, Oct 28, 2009 at 1:37 AM, nitay <[email protected]> wrote:

Hi Zhenyu,

Sorry for the delay. I started working on this a while back, before I
left
my job for another company. Since then I haven't had much time to work
on
HBase unfortunately :(. I'll try to dig up what I had and see what
shape
it's in and update you.

Cheers,
-n


On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote:

 Sorry I must have mistyped, I meant to say "40 seconds".  You can
still see multi-second pauses at times, so you need to give yourself a
bigger buffer.

The parallel threads argument should not be necessary, but you do need
the UseConcMarkSweepGC flag as well.

Let us know how it goes!
-ryan


On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong <
[email protected]>
wrote:

Ryan,
I am very appreciated for your feedbacks.
I have set the zookeeper.session.timeout to seconds which is way
higher
than
40ms.
In the same time, the -Xms is set to 4GB, which should be sufficient.
I also tried GC options like

 -XX:ParallelGCThreads=8
-XX:+UseConcMarkSweepGC

I even set the vm.swappiness=0

However, I still came across the problem that a RegionServer shutdown
itself.

Best,
zhong


On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <[email protected]>
wrote:
 Set the ZK timeout to something like 40ms, and give the GC enough
Xmx
so you never risk entering the much dreaded concurrent-mode-failure
whereby the entire heap must be GCed.

Consider testing Java 7 and the G1 GC.

We could get a JNI thread to do this, but no one has done so yet. I
am
personally hoping for G1 and in the meantime overprovision our Xmx
to
avoid the concurrent mode failures.

-ryan

On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong <
[email protected]>
wrote:

Ryan,

Thank you very much.
May I ask whether there are any ways to get around this problem to
make
HBase more stable?

best,
zhong



On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <[email protected]>
wrote:

 There isnt any working code yet. Just an idea, and a prototype.
There is some sense that if we can get the G1 GC that we could get
rid
of all long pauses, and avoid the need for this.

-ryan

On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong <
[email protected]>
wrote:

Hi,

I am very interesting to the solution that Joey proposed and
would
like
to
have a try.
Does anyone have any ideas on how to deploy this zk_wrapper in
JNI
integration?

I would be very appreciated.

thanks
zhong



Reply via email to