[jira] [Commented] (HADOOP-11693) Azure Storage FileSystem rename operations are throttled too aggressively to complete HBase WAL archiving.

Chris Nauroth (JIRA) Thu, 12 Mar 2015 02:35:12 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357332#comment-14357332
 ]


Chris Nauroth commented on HADOOP-11693:
----------------------------------------

Thanks for the explanation about normal case vs. the new backoff policy.  That 
makes sense.

The eclipse:eclipse failure looks unrelated.  I couldn't reproduce it locally.

Sorry to nitpick, but there are still some lines in 
{{AzureNativeFileSystemStore}} that exceed the 80 character limit.  I know 
there are some existing lines in this file that already break the rule.  Don't 
worry about cleaning up all of the existing code, but please make sure all 
lines touched in the patch adhere to the 80 character limit.

The findbugs warning is legitimate.  I'm not sure why it's triggering now with 
this patch, as it appears the problem existed before the patch.  We can fix 
this by changing the {{catch (Exception e)}} so that there are 2 separate catch 
clauses for {{catch (StorageException e)}} and {{catch (URISyntaxException 
e)}}.  Each one can be rethrown wrapped as an {{AzureException}}.

We're almost there.  Thanks, Duo!

> Azure Storage FileSystem rename operations are throttled too aggressively to 
> complete HBase WAL archiving.
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-11693
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11693
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: tools
>            Reporter: Duo Xu
>            Assignee: Duo Xu
>         Attachments: HADOOP-11681.01.patch, HADOOP-11681.02.patch
>
>
> One of our customers' production HBase clusters was periodically throttled by 
> Azure storage, when HBase was archiving old WALs. HMaster aborted the region 
> server and tried to restart it.
> However, since the cluster was still being throttled by Azure storage, the 
> upcoming distributed log splitting also failed. Sometimes hbase:meta table 
> was on this region server and finally showed offline, which cause the whole 
> cluster in bad state.
> {code}
> 2015-03-01 18:36:45,623 ERROR org.apache.hadoop.hbase.master.HMaster: Region 
> server 
> workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044 
> reported a fatal error:
> ABORTING region server 
> workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044: IOE 
> in log roller
> Cause:
> org.apache.hadoop.fs.azure.AzureException: 
> com.microsoft.windowsazure.storage.StorageException: The server is busy.
>       at 
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446)
>       at 
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2367)
>       at 
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1960)
>       at 
> org.apache.hadoop.hbase.util.FSUtils.renameAndSetModifyTime(FSUtils.java:1719)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.archiveLogFile(FSHLog.java:798)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanOldLogs(FSHLog.java:656)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:593)
>       at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: com.microsoft.windowsazure.storage.StorageException: The server is 
> busy.
>       at 
> com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
>       at 
> com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
>       at 
> com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
>       at 
> com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762)
>       at 
> org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350)
>       at 
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439)
>       ... 8 more
> 2015-03-01 18:43:29,072 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
> Caught throwable while processing event M_META_SERVER_SHUTDOWN
> java.io.IOException: failed log splitting for 
> workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845307901, 
> will retry
>       at 
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:71)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.fs.azure.AzureException: 
> com.microsoft.windowsazure.storage.StorageException: The server is busy.
>       at 
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446)
>       at 
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem$FolderRenamePending.execute(NativeAzureFileSystem.java:393)
>       at 
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1973)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.getLogDirs(MasterFileSystem.java:319)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:406)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:302)
>       at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:293)
>       at 
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:64)
>       ... 4 more
> Caused by: com.microsoft.windowsazure.storage.StorageException: The server is 
> busy.
>       at 
> com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
>       at 
> com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
>       at 
> com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
>       at 
> com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762)
>       at 
> org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350)
>       at 
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439)
>       ... 11 more
> Sun Mar 01 18:59:51 GMT 2015, 
> org.apache.hadoop.hbase.client.RpcRetryingCaller@aa93ac7, 
> org.apache.hadoop.hbase.NotServingRegionException: 
> org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is 
> not online on 
> workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1425235081338
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3076)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28861)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
>       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
>       at 
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>       at 
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>       at 
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> When archiving old WALs, WASB will do rename operation by copying src blob to 
> destination blob and deleting the src blob. Copy blob is very costly in Azure 
> storage and during Azure storage gc, it will be highly likely throttled. The 
> throttling by Azure storage usually ends within 15mins. Current WASB retry 
> policy is exponential retry, but only last at most for 2min. Short term fix 
> will be adding a more intensive exponential retry when copy blob is throttled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-11693) Azure Storage FileSystem rename operations are throttled too aggressively to complete HBase WAL archiving.

Reply via email to