[ 
https://issues.apache.org/jira/browse/HADOOP-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Xu updated HADOOP-12403:
----------------------------
    Description: 
Azure HDI HBase clusters use Azure blob storage as file system. We found that 
the bottle neck was during writing to write ahead log (WAL). The latest HBase 
WAL write model (HBASE-8755) uses multiple AsyncSyncer threads to sync data to 
HDFS. However, our WASB driver is still based on a single thread model. Thus 
when the sync threads call into WASB layer, every time only one thread will be 
allowed to send data to Azure storage.This jira is to introduce a new write 
model in WASB layer to allow multiple writes in parallel.
1. Since We use page blob for WAL, this will cause "holes" in the page blob as 
every write starts on a new page. We use the first two bytes of every page to 
record the actual data size of the current page.
2. When reading WAL, we need to know the actual size of the WAL. This should be 
the sum of the number represented by the first two bytes of every page. However 
looping over every page to get the size will be very slow. So during writing, 
every time a write succeeds, a metadata of the blob called 
"total_data_uploaded" will be updated.
3. Although we allow multiple writes in flight, we need to make sure the sync 
threads which call into WASB layers return in order. Reading HBase source code 
FSHLog.java, we find that every sync request is associated with a transaction 
id. If the sync succeeds, all the transactions before this transaction id are 
assumed to be in Azure Storage. We use a queue to store the sync requests and 
make sure they return to HBase layer in order.

  was:
Azure HDI HBase clusters use Azure blob storage as file system. We found that 
the bottle neck was during writing to write ahead log (WAL). The latest HBase 
WAL write model (HBASE-8755) uses multiple AsyncSyncer threads to sync data to 
HDFS. However, our WASB driver is still based on a single thread model. Thus 
when the sync threads call into WASB layer, every time only one thread will be 
allowed to send data to Azure storage.This jira is to introduce a new write 
model in WASB layer to allow multiple writes in parallel.
1. Since We use page blob for WAL, this will cause "holes" in the page blob as 
every write starts on a new page. We use the first two bytes of every page to 
record the actual data size of the current page.
2. When reading WAL, we need to know the actual size of the WAL. This should be 
the sum of the number represented by the first two bytes of every page. However 
looping over every page to get the size will be very slow. So during writing, 
the writer threads will keep updating a metadata of the blob called 
"total_data_uploaded".
3. Although we allow multiple writes in flight, we need to make sure the sync 
threads which call into WASB layers return in order. Reading HBase source code 
FSHLog.java, we find that every sync requests associated with a transaction id. 
If the sync succeeds, all the transactions before this transaction id are 
assumed to be in Azure Storage. We use a queue to store the sync requests and 
make sure they return to HBase layer in order.


> Enable multiple writes in flight for HBase WAL writing
> ------------------------------------------------------
>
>                 Key: HADOOP-12403
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12403
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools
>            Reporter: Duo Xu
>            Assignee: Duo Xu
>
> Azure HDI HBase clusters use Azure blob storage as file system. We found that 
> the bottle neck was during writing to write ahead log (WAL). The latest HBase 
> WAL write model (HBASE-8755) uses multiple AsyncSyncer threads to sync data 
> to HDFS. However, our WASB driver is still based on a single thread model. 
> Thus when the sync threads call into WASB layer, every time only one thread 
> will be allowed to send data to Azure storage.This jira is to introduce a new 
> write model in WASB layer to allow multiple writes in parallel.
> 1. Since We use page blob for WAL, this will cause "holes" in the page blob 
> as every write starts on a new page. We use the first two bytes of every page 
> to record the actual data size of the current page.
> 2. When reading WAL, we need to know the actual size of the WAL. This should 
> be the sum of the number represented by the first two bytes of every page. 
> However looping over every page to get the size will be very slow. So during 
> writing, every time a write succeeds, a metadata of the blob called 
> "total_data_uploaded" will be updated.
> 3. Although we allow multiple writes in flight, we need to make sure the sync 
> threads which call into WASB layers return in order. Reading HBase source 
> code FSHLog.java, we find that every sync request is associated with a 
> transaction id. If the sync succeeds, all the transactions before this 
> transaction id are assumed to be in Azure Storage. We use a queue to store 
> the sync requests and make sure they return to HBase layer in order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to