Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)

Jonathan Gray Wed, 28 Oct 2009 11:57:43 -0700

These client error messages are not particular descriptive as to theroot cause (they are fatal errors, or close to it).

What is going on in your regionservers when these errors happen? Checkthe master and RS logs.

Also, you definitely do not want 19 zookeeper nodes. Reduce that to 3or 5 max.

What is the hardware you are using for these nodes, and what settings doyou have for heap/GC?


JG

Zhenyu Zhong wrote:

Stack,

Thank you very much for your comments.
I am running a cluster with 20 nodes. I set 19 as both regionserver and
zookeeper quorums.
The versions I am using are  Hadoop0.20.1 and HBase0.20.1.
I started with an empty table and try to load 200 million records into that
empty table.
There is a key in each record. Logically, in my MR program, during the
setup, I opened an HTable, in my mapper, I fetch the record from HTable via
key in the record, then make some changes to the columns and update that row
back to HTable through TableOutputFormat by passing a put. There is no
reduce tasks involved here.  (Though it is unnecessary to fetch row from an
empty table, I just intended to do that)

Additionally, when I reduced the number of regionservers and number of
zookeeper quorums.
I had different errors:
org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying
to locate root region at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:929)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:580)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:589)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:562)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:693)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:593)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:556)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:127) at
org.apache.hadoop.hbase.client.HTable.(HTable.java:105) at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat.getRecordWriter(TableOutputFormat.java:116)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:573) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at
org.apache.hadoop.mapred.Child.main(Child.java:170)

Many thanks in advance.
zhenyu




On Wed, Oct 28, 2009 at 12:39 PM, stack <[email protected]> wrote:

Whats your cluster topology?  How many nodes involved?  When you see the
below message, how many regions in your table?  How are you loading your
table?
Thanks,
St.Ack

On Wed, Oct 28, 2009 at 7:45 AM, Zhenyu Zhong <[email protected]

wrote:
Nitay,

I am very appreciated.

As Ryan suggested, I increased the zookeeper session timeout to 40seconds
along with the GC options -XX:ParallelGCThreads=8

 -XX:+UseConcMarkSweepGC

in place. I set the Heapsize to 4GB.  I also set the vm.swappiness=0.

However it still ran into problem. Please find the following errors.

org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
contact region server x.x.x.x:60021 for region
YYYY,117.99.7.153,1256396118155, row '1170491458', but failed after 10
attempts.
Exceptions:
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy to /x.x.x.x:60021 after attempts=1

       at

org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1001)

       at org.apache.hadoop.hbase.client.HTable.get(HTable.java:413)


The input file is about 10GB around 200million rows of data.
This load doesn't seem too large. However this kind of errors keep

popping

up.

Does Regionserver need to be deployed to dedicated machines?
Does Zookeeper need to be deployed to dedicated machines as well?

Best,
zhenyu



On Wed, Oct 28, 2009 at 1:37 AM, nitay <[email protected]> wrote:

Hi Zhenyu,

Sorry for the delay. I started working on this a while back, before I

left

my job for another company. Since then I haven't had much time to work

on

HBase unfortunately :(. I'll try to dig up what I had and see what

shape

it's in and update you.

Cheers,
-n


On Oct 27, 2009, at 3:38 PM, Ryan Rawson wrote:

 Sorry I must have mistyped, I meant to say "40 seconds".  You can

still see multi-second pauses at times, so you need to give yourself a
bigger buffer.

The parallel threads argument should not be necessary, but you do need
the UseConcMarkSweepGC flag as well.

Let us know how it goes!
-ryan


On Tue, Oct 27, 2009 at 3:19 PM, Zhenyu Zhong <

[email protected]>

wrote:

Ryan,
I am very appreciated for your feedbacks.
I have set the zookeeper.session.timeout to seconds which is way

higher

than
40ms.
In the same time, the -Xms is set to 4GB, which should be sufficient.
I also tried GC options like

 -XX:ParallelGCThreads=8
-XX:+UseConcMarkSweepGC

I even set the vm.swappiness=0

However, I still came across the problem that a RegionServer shutdown
itself.

Best,
zhong


On Tue, Oct 27, 2009 at 6:05 PM, Ryan Rawson <[email protected]>

wrote:

 Set the ZK timeout to something like 40ms, and give the GC enough

Xmx

so you never risk entering the much dreaded concurrent-mode-failure
whereby the entire heap must be GCed.

Consider testing Java 7 and the G1 GC.

We could get a JNI thread to do this, but no one has done so yet. I

am

personally hoping for G1 and in the meantime overprovision our Xmx

to

avoid the concurrent mode failures.

-ryan

On Tue, Oct 27, 2009 at 2:59 PM, Zhenyu Zhong <

[email protected]>

wrote:

Ryan,

Thank you very much.
May I ask whether there are any ways to get around this problem to

make

HBase more stable?

best,
zhong



On Tue, Oct 27, 2009 at 4:06 PM, Ryan Rawson <[email protected]>
wrote:

 There isnt any working code yet. Just an idea, and a prototype.

There is some sense that if we can get the G1 GC that we could get

rid

of all long pauses, and avoid the need for this.

-ryan

On Mon, Oct 26, 2009 at 2:30 PM, Zhenyu Zhong <
[email protected]>
wrote:

Hi,

I am very interesting to the solution that Joey proposed and

would

like

to

have a try.
Does anyone have any ideas on how to deploy this zk_wrapper in

JNI

integration?

I would be very appreciated.

thanks
zhong

Re: regarding to HBase 1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)

Reply via email to