[
https://issues.apache.org/jira/browse/HADOOP-11685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968271#comment-14968271
]
Duo Xu commented on HADOOP-11685:
---------------------------------
[~cnauroth]
Yes, the scenario you are talking about is possible. During distributed log
splitting, the general steps are
1. One node or regionserver crashes
2. Other nodes get the WALs of that node, and for each node, a worker thread
will start to split the WALs it received from that crashed node
3. The worker thread will create "recovered.edits" folder under the region
whose edits were recorded in the WAL if not exists, for example,
{noformat}wasb:///hbase/data/default/table/3c842e8823c192d1028dc72ac3f22886/recovered.edits/{noformat}
4. Then it will write edits of the WAL to a temp file under the
"recovered.edits" folder, like
{noformat}wasb:///hbase/data/default/table/3c842e8823c192d1028dc72ac3f22886/recovered.edits/15.temp{noformat}.
In this step, each thread is writing to different file so no conflicts.
5. Finally after it finished reading that WAL file, it will rename it to
{noformat}wasb:///hbase/data/default/table/3c842e8823c192d1028dc72ac3f22886/recovered.edits/15{noformat}
for log replay. In this step, it will update the "recovered.edits" folder
lastmodifiedtime property. And multiple threads may update the same folder, but
this is fixed in HADOOP-11523.
6. Go back to step 3.
Since WALs contain the same range of regions, the worker thread on each node
may do step 3~5 at the same time on the same region folder path.
So you think the exception happens when for example the worker thread 1 is
doing 5 and the worker thread 2 is doing 3. However, if you look at mkdir in
WASB, it will pre-check whether "recovered.edits" has already been created or
not. So this case is based on the thread 1 finishes step 3-5 between thread2
passes pre-check and starts to create the empty blob in step 3. This is a very
short time but possible.
However, a more possible case is when multiple nodes get the WALs from the
crashed node, and some time they all find the first edits in their WAL which
belong to the same region, then they try to do step 3, since these edits are
the first one belong to that region in each WAL, there is no "recovered.edits"
folder under that region. Thus the pre-check will pass for all worker threads.
Then all of them are trying to create a empty blob on the same path. The
exception happens when calling
{noformat}openOutputStream(blob).close();{noformat}. Looking into the code,
although WASB does not explicitly acquire lease on this call, but internally
since this is a write operation, a lease will be automatically acquired in SDK
layer. If worker thread 1 acquires the lease, other worker threads will fail
here.
No matter which case, since this mkdir operation is simply trying to create a
"recovered.edits" folder or other folders in other scenarios, so if the
exception is "There is currently a lease on the blob...", which means the
folder/blob has already been created, we just need to return success rather
than throwing exception here.
> StorageException complaining " no lease ID" during HBase distributed log
> splitting
> ----------------------------------------------------------------------------------
>
> Key: HADOOP-11685
> URL: https://issues.apache.org/jira/browse/HADOOP-11685
> Project: Hadoop Common
> Issue Type: Bug
> Components: tools
> Reporter: Duo Xu
> Assignee: Duo Xu
> Attachments: HADOOP-11685.01.patch, HADOOP-11685.02.patch,
> HADOOP-11685.03.patch
>
>
> This is similar to HADOOP-11523, but in a different place. During HBase
> distributed log splitting, multiple threads will access the same folder
> called "recovered.edits". However, lots of places in our WASB code did not
> acquire lease and simply passed null to Azure storage, which caused this
> issue.
> {code}
> 2015-02-26 03:21:28,871 WARN
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of
> WALs/workernode4.hbaseproddm2001.g6.internal.cloudapp.net,60020,1422071058425-splitting/workernode4.hbaseproddm2001.g6.internal.cloudapp.net%2C60020%2C1422071058425.1424914216773
> failed, returning error
> java.io.IOException: org.apache.hadoop.fs.azure.AzureException:
> java.io.IOException
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.checkForErrors(HLogSplitter.java:633)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.access$000(HLogSplitter.java:121)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$OutputSink.finishWriting(HLogSplitter.java:964)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOutputSink.finishWritingAndClose(HLogSplitter.java:1019)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:359)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:223)
> at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:142)
> at
> org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.process(HLogSplitterHandler.java:79)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.fs.azure.AzureException: java.io.IOException
> at
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.storeEmptyFolder(AzureNativeFileSystemStore.java:1477)
> at
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.mkdirs(NativeAzureFileSystem.java:1862)
> at
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.mkdirs(NativeAzureFileSystem.java:1812)
> at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1815)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getRegionSplitEditsPath(HLogSplitter.java:502)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOutputSink.createWAP(HLogSplitter.java:1211)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOutputSink.getWriterAndPath(HLogSplitter.java:1200)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOutputSink.append(HLogSplitter.java:1243)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.writeBuffer(HLogSplitter.java:851)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.doRun(HLogSplitter.java:843)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.run(HLogSplitter.java:813)
> Caused by: java.io.IOException
> at
> com.microsoft.windowsazure.storage.core.Utility.initIOException(Utility.java:493)
> at
> com.microsoft.windowsazure.storage.blob.BlobOutputStream.close(BlobOutputStream.java:282)
> at
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.storeEmptyFolder(AzureNativeFileSystemStore.java:1472)
> ... 10 more
> Caused by: com.microsoft.windowsazure.storage.StorageException: There is
> currently a lease on the blob and no lease ID was specified in the request.
> at
> com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
> at
> com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
> at
> com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
> at
> com.microsoft.windowsazure.storage.blob.CloudBlockBlob.commitBlockList(CloudBlockBlob.java:248)
> at
> com.microsoft.windowsazure.storage.blob.BlobOutputStream.commit(BlobOutputStream.java:319)
> at
> com.microsoft.windowsazure.storage.blob.BlobOutputStream.close(BlobOutputStream.java:279)
> ... 11 more
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)