[ 
https://issues.apache.org/jira/browse/HADOOP-11685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968271#comment-14968271
 ] 

Duo Xu commented on HADOOP-11685:
---------------------------------

[~cnauroth]

Yes, the scenario you are talking about is possible. During distributed log 
splitting, the general steps are
1. One node or regionserver crashes
2. Other nodes get the WALs of that node, and for each node, a worker thread 
will start to split the WALs it received from that crashed node
3. The worker thread will create "recovered.edits" folder under the region 
whose edits were recorded in the WAL if not exists, for example, 
{noformat}wasb:///hbase/data/default/table/3c842e8823c192d1028dc72ac3f22886/recovered.edits/{noformat}
4. Then it will write edits of the WAL to a temp file under the 
"recovered.edits" folder, like 
{noformat}wasb:///hbase/data/default/table/3c842e8823c192d1028dc72ac3f22886/recovered.edits/15.temp{noformat}.
 In this step, each thread is writing to different file so no conflicts.
5. Finally after it finished reading that WAL file, it will rename it to 
{noformat}wasb:///hbase/data/default/table/3c842e8823c192d1028dc72ac3f22886/recovered.edits/15{noformat}
 for log replay. In this step, it will update the "recovered.edits" folder 
lastmodifiedtime property. And multiple threads may update the same folder, but 
this is fixed in HADOOP-11523.
6. Go back to step 3.

Since WALs contain the same range of regions, the worker thread on each node 
may do step 3~5 at the same time on the same region folder path.

So you think the exception happens when for example the worker thread 1 is 
doing 5 and the worker thread 2 is doing 3. However, if you look at mkdir in 
WASB, it will pre-check whether "recovered.edits" has already been created or 
not. So this case is based on the thread 1 finishes step 3-5 between thread2 
passes pre-check and starts to create the empty blob in step 3. This is a very 
short time but possible.

However, a more possible case is when multiple nodes get the WALs from the 
crashed node, and some time they all find the first edits in their WAL which 
belong to the same region, then they try to do step 3, since these edits are 
the first one belong to that region in each WAL, there is no "recovered.edits" 
folder under that region. Thus the pre-check will pass for all worker threads. 
Then all of them are trying to create a empty blob on the same path. The 
exception happens when calling 
{noformat}openOutputStream(blob).close();{noformat}. Looking into the code, 
although WASB does not explicitly acquire lease on this call, but internally 
since this is a write operation, a lease will be automatically acquired in SDK 
layer. If worker thread 1 acquires the lease, other worker threads will fail 
here.

No matter which case, since this mkdir operation is simply trying to create a 
"recovered.edits" folder or other folders in other scenarios, so if the 
exception is "There is currently a lease on the blob...", which means the 
folder/blob has already been created, we just need to return success rather 
than throwing exception here.

> StorageException complaining " no lease ID" during HBase distributed log 
> splitting
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-11685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11685
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: tools
>            Reporter: Duo Xu
>            Assignee: Duo Xu
>         Attachments: HADOOP-11685.01.patch, HADOOP-11685.02.patch, 
> HADOOP-11685.03.patch
>
>
> This is similar to HADOOP-11523, but in a different place. During HBase 
> distributed log splitting, multiple threads will access the same folder 
> called "recovered.edits". However, lots of places in our WASB code did not 
> acquire lease and simply passed null to Azure storage, which caused this 
> issue.
> {code}
> 2015-02-26 03:21:28,871 WARN 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of 
> WALs/workernode4.hbaseproddm2001.g6.internal.cloudapp.net,60020,1422071058425-splitting/workernode4.hbaseproddm2001.g6.internal.cloudapp.net%2C60020%2C1422071058425.1424914216773
>  failed, returning error
> java.io.IOException: org.apache.hadoop.fs.azure.AzureException: 
> java.io.IOException
>       at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.checkForErrors(HLogSplitter.java:633)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.access$000(HLogSplitter.java:121)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$OutputSink.finishWriting(HLogSplitter.java:964)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOutputSink.finishWritingAndClose(HLogSplitter.java:1019)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:359)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:223)
>       at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:142)
>       at 
> org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.process(HLogSplitterHandler.java:79)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.fs.azure.AzureException: java.io.IOException
>       at 
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.storeEmptyFolder(AzureNativeFileSystemStore.java:1477)
>       at 
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.mkdirs(NativeAzureFileSystem.java:1862)
>       at 
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.mkdirs(NativeAzureFileSystem.java:1812)
>       at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1815)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getRegionSplitEditsPath(HLogSplitter.java:502)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOutputSink.createWAP(HLogSplitter.java:1211)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOutputSink.getWriterAndPath(HLogSplitter.java:1200)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOutputSink.append(HLogSplitter.java:1243)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.writeBuffer(HLogSplitter.java:851)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.doRun(HLogSplitter.java:843)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.run(HLogSplitter.java:813)
> Caused by: java.io.IOException
>       at 
> com.microsoft.windowsazure.storage.core.Utility.initIOException(Utility.java:493)
>       at 
> com.microsoft.windowsazure.storage.blob.BlobOutputStream.close(BlobOutputStream.java:282)
>       at 
> org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.storeEmptyFolder(AzureNativeFileSystemStore.java:1472)
>       ... 10 more
> Caused by: com.microsoft.windowsazure.storage.StorageException: There is 
> currently a lease on the blob and no lease ID was specified in the request.
>       at 
> com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
>       at 
> com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
>       at 
> com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
>       at 
> com.microsoft.windowsazure.storage.blob.CloudBlockBlob.commitBlockList(CloudBlockBlob.java:248)
>       at 
> com.microsoft.windowsazure.storage.blob.BlobOutputStream.commit(BlobOutputStream.java:319)
>       at 
> com.microsoft.windowsazure.storage.blob.BlobOutputStream.close(BlobOutputStream.java:279)
>       ... 11 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to