[ 
https://issues.apache.org/jira/browse/HBASE-14368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15023706#comment-15023706
 ] 

Enis Soztutar commented on HBASE-14368:
---------------------------------------

False alarm. I think it is network issues from my setup. I have seen this 
before: 
{code}
  <testcase name="testLockupWhenSyncInMiddleOfZigZagSetup" 
classname="org.apache.hadoop.hbase.regionserver.TestWALLockup" time="30.594">
    <error message="test timed out after 30000 milliseconds" 
type="java.lang.Exception"><![CDATA[java.lang.Exception: test timed out after 
30000 milliseconds
  at java.net.PlainDatagramSocketImpl.peekData(Native Method)
  at java.net.DatagramSocket.receive(DatagramSocket.java:767)
  at com.sun.jndi.dns.DnsClient.doUdpQuery(DnsClient.java:416)
  at com.sun.jndi.dns.DnsClient.query(DnsClient.java:210)
  at com.sun.jndi.dns.Resolver.query(Resolver.java:81)
  at com.sun.jndi.dns.DnsContext.c_getAttributes(DnsContext.java:430)
  at 
com.sun.jndi.toolkit.ctx.ComponentDirContext.p_getAttributes(ComponentDirContext.java:231)
  at 
com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.getAttributes(PartialCompositeDirContext.java:139)
  at 
com.sun.jndi.toolkit.url.GenericURLDirContext.getAttributes(GenericURLDirContext.java:103)
  at 
sun.security.krb5.KrbServiceLocator.getKerberosService(KrbServiceLocator.java:87)
  at sun.security.krb5.Config.checkRealm(Config.java:1295)
  at sun.security.krb5.Config.getRealmFromDNS(Config.java:1268)
  at sun.security.krb5.Config.getDefaultRealm(Config.java:1162)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at 
org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:84)
  at 
org.apache.hadoop.security.authentication.util.KerberosName.<clinit>(KerberosName.java:86)
  at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:247)
  at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:234)
  at 
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:749)
  at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:734)
  at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:607)
  at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2748)
  at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2740)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2606)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
  at 
org.apache.hadoop.hbase.regionserver.TestWALLockup.testLockupWhenSyncInMiddleOfZigZagSetup(TestWALLockup.java:194)
{code}

> New TestWALLockup broken by addendum added to parent issue
> ----------------------------------------------------------
>
>                 Key: HBASE-14368
>                 URL: https://issues.apache.org/jira/browse/HBASE-14368
>             Project: HBase
>          Issue Type: Sub-task
>          Components: test
>            Reporter: stack
>            Assignee: stack
>             Fix For: 2.0.0
>
>         Attachments: 14368.txt, 14368.txt
>
>
> My second addendum broke TestWALLockup, the one that did this: 
> https://issues.apache.org/jira/browse/HBASE-14317?focusedCommentId=14730301&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14730301
> {code}
> diff --git 
> a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
>  
> b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
> index 5708c30..c421f5c 100644
> --- 
> a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
> +++ 
> b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
> @@ -878,8 +878,19 @@ public class FSHLog implements WAL {
>          // Let the writer thread go regardless, whether error or not.
>          if (zigzagLatch != null) {
>            zigzagLatch.releaseSafePoint();
> -          // It will be null if we failed our wait on safe point above.
> -          if (syncFuture != null) blockOnSync(syncFuture);
> +          // syncFuture will be null if we failed our wait on safe point 
> above. Otherwise, if
> +          // latch was obtained successfully, the sync we threw in either 
> trigger the latch or it
> +          // got stamped with an exception because the WAL was damaged and 
> we could not sync. Now
> +          // the write pipeline has been opened up again by releasing the 
> safe point, process the
> +          // syncFuture we got above. This is probably a noop but it may be 
> stale exception from
> +          // when old WAL was in place. Catch it if so.
> +          if (syncFuture != null) {
> +            try {
> +              blockOnSync(syncFuture);
> +            } catch (IOException ioe) {
> +              if (LOG.isTraceEnabled()) LOG.trace("Stale sync exception", 
> ioe);
> +            }
> +          }
> {code}
> It broke the test because the test hand feeds appends and syncs with when 
> they should throw exceptions. In the test we manufactured the case where an 
> append fails and we then asserted the following sync would fail.
> Problem was that we expected the failure to be a dropped snapshot failure 
> because fail of sync is a catastrophic event... but our hand feeding actually 
> reproduced the case where a sync goes into the damaged file... before it had 
> rolled... which is no longer a catastrophic event... we just catch and move 
> on.
> The attached patch just removes check for dropped snapshot and that abort was 
> called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to