[
https://issues.apache.org/jira/browse/HBASE-21666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778462#comment-16778462
]
Tak Lon (Stephen) Wu edited comment on HBASE-21666 at 2/26/19 6:54 PM:
-----------------------------------------------------------------------
I have done investigation below, and I found the hanging/slow is related to
test node's network setup and local disk issue. I'd like to propose the
solution to be fail fast instead of timeout at 780+ when possible.
First of all, test methods in {{TestExportSnapshot}} contains two phases of
operations, operations in Mini HBase Cluster and operations in Mini MR Cluster,
and we are only snapshotting 50 rows into a test table (the data is very small).
So, the timeout issue is related the followings
1. the building node has an `incorrect` network interface setup such that
a. it hangs the HDFS file operations e.g.
{quote}2019-02-25 22:28:36,099 ERROR [ClientFinalizer-shutdown-hook]
hdfs.DFSClient(949): Failed to close inode 16420
java.io.EOFException: End of File Exception between local host is:
"f45c89a57f29.ant.amazon.com/192.168.1.15"; destination host is:
"localhost":54524; : java.io.EOFException; For more details see:
[http://wiki.apache.org/hadoop/EOFException]
{quote}
b. server (region server or hmaster) cannot be connected or regions cannot
be assigned and kept retrying till timeout, e.g.
{quote}2019-02-26 09:27:54,754 DEBUG
[RpcServer.default.FPBQ.Fifo.handler=4,queue=0,port=57922]
client.RpcRetryingCallerImpl(132): Call exception, tries=10, retries=19,
started=96205 ms ago, cancelled=false, msg=Call to
f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception:
org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed
servers list: f45c89a57f29-2.local/10.63.166.57:57926, details=row
'testtb-testExportFileSystemStateWithSkipTmp' on table 'hbase:meta' at
region=hbase:meta,,1.1588230740,
hostname=f45c89a57f29-2.local,57926,1551201763075, seqNum=-1, see
[https://s.apache.org/timeout],
exception=org.apache.hadoop.hbase.ipc.FailedServerException: Call to
f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception:
org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed
servers list: f45c89a57f29-2.local/10.63.166.57:57926`
{quote}
2. the building node has an out of disk space issue such node manager is not in
the health state, e.g. I saw from the node manger UI {{1/1 local-dirs are bad:
/yarn/nm; 1/1 log-dirs are bad: /yarn/container-logs}} even if we have set
{{yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage}}
to 99%
In above cases, assuming case 1) is an node setup issues (e.g. in
{{/etc/hosts}}) that can be fixed by the infra admin or the contributor who is
running the unit test on their laptop/machine, we don't need to fix it.
for case 2), I'm thinking to set a new value
{{yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb}} to 128MB
(should be enough for log-dirs and local-dirs) to fail fast when starting the
miniMRCluster by
{{[TestExportSnapshot#setUpBeforeClass|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/snapshot/TestExportSnapshot.java#L100-L104]}}
instead of timeout for 780+ seconds
In fact, if the building node does not have any of the connections and disk
issues, the average time of running all (7) tests within {{TestExportSnapshot}}
is about 280 seconds and IMO it won't be able to speedup with splitting some of
the test methods into a separate classes and tests of each class are executed
in a sequential order (are we running tests in parallel especially for
{{TestExportSnapshot}} which labeled as {{LargeTests}}? when I tested with
{{mvn test -PrunAllTests -Dtest=TestExportSnapshot}}, I didn't see methods are
running concurrently even if I found the {{surefire.secondPartForkCount=5}} for
{{runAllTests}}, but if anyone confirm it does, we can also separate each
method in {{TestExportSnapshot}} to different classes).
So, if we think disk space issue of YARN's nodemanager should be failed fast
when running tests, proposed code change in
{{HBaseTestingUtility#startMiniMapReduceCluster}} should be as below.
Any comments?
{code:java}
@@ -2736,6 +2736,8 @@ public class HBaseTestingUtility extends
HBaseZKTestingUtility {
conf.setIfUnset(
"yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage",
"99.0");
+ // Make sure we have enough disk space for log-dirs and local-dirs
+
conf.setIfUnset("yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb",
"128");
startMiniMapReduceCluster(2);
return mrCluster;
}
{code}
was (Author: taklwu):
I have done investigation below, and I found the hanging/slow is related to
test node's network setup and local disk issue. I'd like to propose the
solution to be fail fast instead of timeout at 780+ when possible.
First of all, test methods in {{TestExportSnapshot}} contains two phases of
operations, operations in Mini HBase Cluster and operations in Mini MR Cluster,
and we are only snapshotting 50 rows into a test table (the data is very small).
So, the timeout issue is related the followings
1. the building node has an `incorrect` network interface setup such that
a. it hangs the HDFS file operations e.g.
{quote}2019-02-25 22:28:36,099 ERROR [ClientFinalizer-shutdown-hook]
hdfs.DFSClient(949): Failed to close inode 16420
java.io.EOFException: End of File Exception between local host is:
"f45c89a57f29.ant.amazon.com/192.168.1.15"; destination host is:
"localhost":54524; : java.io.EOFException; For more details see:
[http://wiki.apache.org/hadoop/EOFException]
{quote}
b. server (region server or hmaster) cannot be connected or regions cannot
be assigned and kept retrying till timeout, e.g.
{quote}2019-02-26 09:27:54,754 DEBUG
[RpcServer.default.FPBQ.Fifo.handler=4,queue=0,port=57922]
client.RpcRetryingCallerImpl(132): Call exception, tries=10, retries=19,
started=96205 ms ago, cancelled=false, msg=Call to
f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception:
org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed
servers list: f45c89a57f29-2.local/10.63.166.57:57926, details=row
'testtb-testExportFileSystemStateWithSkipTmp' on table 'hbase:meta' at
region=hbase:meta,,1.1588230740,
hostname=f45c89a57f29-2.local,57926,1551201763075, seqNum=-1, see
[https://s.apache.org/timeout],
exception=org.apache.hadoop.hbase.ipc.FailedServerException: Call to
f45c89a57f29-2.local/10.63.166.57:57926 failed on local exception:
org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed
servers list: f45c89a57f29-2.local/10.63.166.57:57926`
{quote}
2. the building node has an out of disk space issue such node manager is not in
the health state, e.g. I saw from the node manger UI {{1/1 local-dirs are bad:
/yarn/nm; 1/1 log-dirs are bad: /yarn/container-logs}} even if we have set
{{yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage}}
to 99%
In above cases, assuming case 1) is an node setup issues (e.g. in
{{/etc/hosts}}) that can be fixed by the infra admin or the contributor who is
running the unit test on their laptop/machine, we don't need to fix it.
for case 2), I'm thinking to set a new value
{{yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb}} to 128MB
(should be enough for log-dirs and local-dirs) to fail fast when starting the
miniMRCluster by
{{[TestExportSnapshot#setUpBeforeClass|https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/snapshot/TestExportSnapshot.java#L100-L104]}}
instead of timeout for 780+ seconds
In fact, if the building node does not have any of the connections and disk
issues, the average time of running all tests within {{TestExportSnapshot}} is
about 280 seconds and IMO it won't be able to speedup with splitting some of
the test methods into a separate classes and tests of each class are executed
in a sequential order (are we running tests in parallel especially for
{{TestExportSnapshot}} which labeled as {{LargeTests}}? when I tested with
{{mvn test -PrunAllTests -Dtest=TestExportSnapshot}}, I didn't see methods are
running concurrently even if I found the {{surefire.secondPartForkCount=5}} for
{{runAllTests}}).
So, if we think disk space issue of YARN's nodemanager should be failed fast
when running tests, proposed code change in
{{HBaseTestingUtility#startMiniMapReduceCluster}} should be as below.
Any comments?
{code:java}
@@ -2736,6 +2736,8 @@ public class HBaseTestingUtility extends
HBaseZKTestingUtility {
conf.setIfUnset(
"yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage",
"99.0");
+ // Make sure we have enough disk space for log-dirs and local-dirs
+
conf.setIfUnset("yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb",
"128");
startMiniMapReduceCluster(2);
return mrCluster;
}
{code}
> Break up the TestExportSnapshot UTs; they can timeout
> -----------------------------------------------------
>
> Key: HBASE-21666
> URL: https://issues.apache.org/jira/browse/HBASE-21666
> Project: HBase
> Issue Type: Bug
> Components: test
> Reporter: stack
> Assignee: Tak Lon (Stephen) Wu
> Priority: Major
> Labels: beginner
>
> These timed out for [~Apache9] when he ran with the -PrunAllTests. Suggests
> breaking them up into smaller tests so less likely they'll timeout.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)