[
https://issues.apache.org/jira/browse/HBASE-14368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15023706#comment-15023706
]
Enis Soztutar commented on HBASE-14368:
---------------------------------------
False alarm. I think it is network issues from my setup. I have seen this
before:
{code}
<testcase name="testLockupWhenSyncInMiddleOfZigZagSetup"
classname="org.apache.hadoop.hbase.regionserver.TestWALLockup" time="30.594">
<error message="test timed out after 30000 milliseconds"
type="java.lang.Exception"><![CDATA[java.lang.Exception: test timed out after
30000 milliseconds
at java.net.PlainDatagramSocketImpl.peekData(Native Method)
at java.net.DatagramSocket.receive(DatagramSocket.java:767)
at com.sun.jndi.dns.DnsClient.doUdpQuery(DnsClient.java:416)
at com.sun.jndi.dns.DnsClient.query(DnsClient.java:210)
at com.sun.jndi.dns.Resolver.query(Resolver.java:81)
at com.sun.jndi.dns.DnsContext.c_getAttributes(DnsContext.java:430)
at
com.sun.jndi.toolkit.ctx.ComponentDirContext.p_getAttributes(ComponentDirContext.java:231)
at
com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.getAttributes(PartialCompositeDirContext.java:139)
at
com.sun.jndi.toolkit.url.GenericURLDirContext.getAttributes(GenericURLDirContext.java:103)
at
sun.security.krb5.KrbServiceLocator.getKerberosService(KrbServiceLocator.java:87)
at sun.security.krb5.Config.checkRealm(Config.java:1295)
at sun.security.krb5.Config.getRealmFromDNS(Config.java:1268)
at sun.security.krb5.Config.getDefaultRealm(Config.java:1162)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:84)
at
org.apache.hadoop.security.authentication.util.KerberosName.<clinit>(KerberosName.java:86)
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:247)
at
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:234)
at
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:749)
at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:734)
at
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:607)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2748)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2740)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2606)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at
org.apache.hadoop.hbase.regionserver.TestWALLockup.testLockupWhenSyncInMiddleOfZigZagSetup(TestWALLockup.java:194)
{code}
> New TestWALLockup broken by addendum added to parent issue
> ----------------------------------------------------------
>
> Key: HBASE-14368
> URL: https://issues.apache.org/jira/browse/HBASE-14368
> Project: HBase
> Issue Type: Sub-task
> Components: test
> Reporter: stack
> Assignee: stack
> Fix For: 2.0.0
>
> Attachments: 14368.txt, 14368.txt
>
>
> My second addendum broke TestWALLockup, the one that did this:
> https://issues.apache.org/jira/browse/HBASE-14317?focusedCommentId=14730301&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14730301
> {code}
> diff --git
> a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
>
> b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
> index 5708c30..c421f5c 100644
> ---
> a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
> +++
> b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
> @@ -878,8 +878,19 @@ public class FSHLog implements WAL {
> // Let the writer thread go regardless, whether error or not.
> if (zigzagLatch != null) {
> zigzagLatch.releaseSafePoint();
> - // It will be null if we failed our wait on safe point above.
> - if (syncFuture != null) blockOnSync(syncFuture);
> + // syncFuture will be null if we failed our wait on safe point
> above. Otherwise, if
> + // latch was obtained successfully, the sync we threw in either
> trigger the latch or it
> + // got stamped with an exception because the WAL was damaged and
> we could not sync. Now
> + // the write pipeline has been opened up again by releasing the
> safe point, process the
> + // syncFuture we got above. This is probably a noop but it may be
> stale exception from
> + // when old WAL was in place. Catch it if so.
> + if (syncFuture != null) {
> + try {
> + blockOnSync(syncFuture);
> + } catch (IOException ioe) {
> + if (LOG.isTraceEnabled()) LOG.trace("Stale sync exception",
> ioe);
> + }
> + }
> {code}
> It broke the test because the test hand feeds appends and syncs with when
> they should throw exceptions. In the test we manufactured the case where an
> append fails and we then asserted the following sync would fail.
> Problem was that we expected the failure to be a dropped snapshot failure
> because fail of sync is a catastrophic event... but our hand feeding actually
> reproduced the case where a sync goes into the damaged file... before it had
> rolled... which is no longer a catastrophic event... we just catch and move
> on.
> The attached patch just removes check for dropped snapshot and that abort was
> called.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)