[
https://issues.apache.org/jira/browse/HADOOP-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Duo Xu updated HADOOP-11693:
----------------------------
Attachment: HADOOP-11681.02.patch
[~cnauroth]
There is a default retry policy for all the Azure storage calls, and it is
initialized with the Azure storage client.
{code}
private static final int DEFAULT_MIN_BACKOFF_INTERVAL = 1 * 1000; // 1s
private static final int DEFAULT_MAX_BACKOFF_INTERVAL = 30 * 1000; // 30s
private static final int DEFAULT_BACKOFF_INTERVAL = 1 * 1000; // 1s
private static final int DEFAULT_MAX_RETRY_ATTEMPTS = 15;
{code}
The backoff interval is 1s. Azure storage throttling issue caused by Azure
storage GC happens very rarely. Until now only one customer met this issue and
only last 10-15 mins every day.
Below is the retry policy for copyblob,
{code}
private static final int DEFAULT_COPYBLOB_MIN_BACKOFF_INTERVAL = 3 * 1000;
// 3s
private static final int DEFAULT_COPYBLOB_MAX_BACKOFF_INTERVAL = 90 * 1000;
// 90s
private static final int DEFAULT_COPYBLOB_BACKOFF_INTERVAL = 30 * 1000;
// 30s
private static final int DEFAULT_COPYBLOB_MAX_RETRY_ATTEMPTS = 15;
{code}
The backoff is longer, 15s. We set these values in order to let it retry as
much as 15min. We only apply this to the rare Azure storage GC case so we do
not lose performance in the normal cases.
I have changed the format.
> Azure Storage FileSystem rename operations are throttled too aggressively to
> complete HBase WAL archiving.
> ----------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-11693
> URL: https://issues.apache.org/jira/browse/HADOOP-11693
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Duo Xu
> Assignee: Duo Xu
> Attachments: HADOOP-11681.01.patch, HADOOP-11681.02.patch
>
>
> One of our customers' production HBase clusters was periodically throttled by
> Azure storage, when HBase was archiving old WALs. HMaster aborted the region
> server and tried to restart it.
> However, since the cluster was still being throttled by Azure storage, the
> upcoming distributed log splitting also failed. Sometimes hbase:meta table
> was on this region server and finally showed offline, which cause the whole
> cluster in bad state.
> {code}
> 2015-03-01 18:36:45,623 ERROR org.apache.hadoop.hbase.master.HMaster: Region
> server
> workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044
> reported a fatal error:
> ABORTING region server
> workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044: IOE
> in log roller
> Cause:
> org.apache.hadoop.fs.azure.AzureException:
> com.microsoft.windowsazure.storage.StorageException: The server is busy.
> at
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446)
> at
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2367)
> at
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1960)
> at
> org.apache.hadoop.hbase.util.FSUtils.renameAndSetModifyTime(FSUtils.java:1719)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.archiveLogFile(FSHLog.java:798)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanOldLogs(FSHLog.java:656)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:593)
> at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: com.microsoft.windowsazure.storage.StorageException: The server is
> busy.
> at
> com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
> at
> com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
> at
> com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
> at
> com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762)
> at
> org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350)
> at
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439)
> ... 8 more
> 2015-03-01 18:43:29,072 ERROR org.apache.hadoop.hbase.executor.EventHandler:
> Caught throwable while processing event M_META_SERVER_SHUTDOWN
> java.io.IOException: failed log splitting for
> workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845307901,
> will retry
> at
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:71)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.fs.azure.AzureException:
> com.microsoft.windowsazure.storage.StorageException: The server is busy.
> at
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446)
> at
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem$FolderRenamePending.execute(NativeAzureFileSystem.java:393)
> at
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1973)
> at
> org.apache.hadoop.hbase.master.MasterFileSystem.getLogDirs(MasterFileSystem.java:319)
> at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:406)
> at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:302)
> at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:293)
> at
> org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:64)
> ... 4 more
> Caused by: com.microsoft.windowsazure.storage.StorageException: The server is
> busy.
> at
> com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
> at
> com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
> at
> com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
> at
> com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762)
> at
> org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350)
> at
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439)
> ... 11 more
> Sun Mar 01 18:59:51 GMT 2015,
> org.apache.hadoop.hbase.client.RpcRetryingCaller@aa93ac7,
> org.apache.hadoop.hbase.NotServingRegionException:
> org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is
> not online on
> workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1425235081338
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3076)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28861)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> When archiving old WALs, WASB will do rename operation by copying src blob to
> destination blob and deleting the src blob. Copy blob is very costly in Azure
> storage and during Azure storage gc, it will be highly likely throttled. The
> throttling by Azure storage usually ends within 15mins. Current WASB retry
> policy is exponential retry, but only last at most for 2min. Short term fix
> will be adding a more intensive exponential retry when copy blob is throttled.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)