[
https://issues.apache.org/jira/browse/HDFS-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224146#comment-16224146
]
Allen Wittenauer commented on HDFS-12711:
-----------------------------------------
I've got a hypothesis.
Somewhere in the HDFS code base is a try/catch block that is effectively
ignoring system exceptions. A test times out and surefire sends (probably) a
SIGINT. The try/catch grabs the exception and tosses it to the side, all the
while eating CPU and IO. This situation makes more tests time out. surefire
sends more SIGINTs which also either get ignored or never "make it" to the
process due to CPU being scarce. Surefire, thinking that those were received,
fires off even more tests ...
This pattern continues until eventually there is nothing left for surefire
and/or maven to die on its own, leaving lots of unreaped children, doing
nothing but destroying the box.
One thing has been bothering me. Why are projects like HBase that are using
openjdk7 + some form of branch-2 code base not seeing these problems?
What if the code path was a less frequently traveled one? A feature that isn't
heavily used. For the vast majority of committers testing a release, it's
probably not even tested "for reals", never mind in a hostile environment where
CPU, IO, whatever is scarce. But the HDFS unit tests (and maybe the MR unit
tests) would almost certainly hit that path, probably several times over.
> deadly hdfs test
> ----------------
>
> Key: HDFS-12711
> URL: https://issues.apache.org/jira/browse/HDFS-12711
> Project: Hadoop HDFS
> Issue Type: Test
> Affects Versions: 2.9.0, 2.8.2
> Reporter: Allen Wittenauer
> Priority: Critical
> Attachments: HDFS-12711.branch-2.00.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]