[
https://issues.apache.org/jira/browse/HADOOP-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Purtell moved HBASE-13173 to HADOOP-11693:
-------------------------------------------------
Key: HADOOP-11693 (was: HBASE-13173)
Project: Hadoop Common (was: HBase)
> Azure Storage FileSystem rename operations are throttled too aggressively to
> complete HBase WAL archiving.
> ----------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-11693
> URL: https://issues.apache.org/jira/browse/HADOOP-11693
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Duo Xu
> Attachments: HADOOP-11681.01.patch
>
>
> One of our customers' production HBase clusters was periodically throttled by
> Azure storage, when HBase was archiving old WALs. HMaster aborted the region
> server and tried to restart it.
> However, since the cluster was still being throttled by Azure storage, the
> upcoming distributed log splitting also failed. Sometimes hbase:meta table
> was on this region server and finally showed offline, which cause the whole
> cluster in bad state.
> {code}
> 2015-03-01 18:36:45,623 ERROR org.apache.hadoop.hbase.master.HMaster: Region
> server
> workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044
> reported a fatal error:
> ABORTING region server
> workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044: IOE
> in log roller
> Cause:
> org.apache.hadoop.fs.azure.AzureException:
> com.microsoft.windowsazure.storage.StorageException: The server is busy.
> at
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446)
> at
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2367)
> at
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1960)
> at
> org.apache.hadoop.hbase.util.FSUtils.renameAndSetModifyTime(FSUtils.java:1719)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.archiveLogFile(FSHLog.java:798)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanOldLogs(FSHLog.java:656)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:593)
> at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: com.microsoft.windowsazure.storage.StorageException: The server is
> busy.
> at
> com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
> at
> com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
> at
> com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
> at
> com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762)
> at
> org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350)
> at
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439)
> ... 8 more
> 2015-03-01 18:43:29,072 ERROR org.apache.hadoop.hbase.executor.EventHandler:
> Caught throwable while processing event M_META_SERVER_SHUTDOWN
> java.io.IOException: failed log splitting for
> workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845307901,
> will retry
> at
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:71)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.fs.azure.AzureException:
> com.microsoft.windowsazure.storage.StorageException: The server is busy.
> at
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446)
> at
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem$FolderRenamePending.execute(NativeAzureFileSystem.java:393)
> at
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1973)
> at
> org.apache.hadoop.hbase.master.MasterFileSystem.getLogDirs(MasterFileSystem.java:319)
> at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:406)
> at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:302)
> at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:293)
> at
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:64)
> ... 4 more
> Caused by: com.microsoft.windowsazure.storage.StorageException: The server is
> busy.
> at
> com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
> at
> com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
> at
> com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
> at
> com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762)
> at
> org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350)
> at
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439)
> ... 11 more
> Sun Mar 01 18:59:51 GMT 2015,
> org.apache.hadoop.hbase.client.RpcRetryingCaller@aa93ac7,
> org.apache.hadoop.hbase.NotServingRegionException:
> org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is
> not online on
> workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1425235081338
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3076)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28861)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> When archiving old WALs, WASB will do rename operation by copying src blob to
> destination blob and deleting the src blob. Copy blob is very costly in Azure
> storage and during Azure storage gc, it will be highly likely throttled. The
> throttling by Azure storage usually ends within 15mins. Current WASB retry
> policy is exponential retry, but only last at most for 2min. Short term fix
> will be adding a more intensive exponential retry when copy blob is throttled.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)