[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037472#comment-15037472
 ] 

Duo Zhang commented on HBASE-14790:
-----------------------------------

Oh, I think we could not fix HBASE-14004 without changing the replication 
module of HBase. No matter how we implement DFSOutputStream, think of this 
scenario:

1. rs flush an WAL entry to dn1, dn2 and dn3.
2. dn1 received the WAL entry, and it is read by ReplicationSource and 
replicated to slave cluster.
3. dn1 and rs both crash, dn2 and dn3 has not received this WAL entry yet, and 
rs has not bumped the GS of this block yet.
4. NameNode complete the file with a length that does not contains this WAL 
entry since the GS of blocks on dn2 and dn3 is correct and NameNode does not 
know there used to be a block with longer length.
5. whoops...

So I think every rs should keep an "acked length" of the current writing WAL 
file, an when doing replication, ReplicationSource should ask this length first 
before reading and do not read beyond it. If we have this logic, then the 
implementation of the new "DFSOutputStream" is much simpler. We could just 
truncate the file if writing WAL failed on some datanode with our "acked 
length" and fail all the entries after the "acked length". This can keep all 
things consistency.

Thanks.

> Implement a new DFSOutputStream for logging WAL only
> ----------------------------------------------------
>
>                 Key: HBASE-14790
>                 URL: https://issues.apache.org/jira/browse/HBASE-14790
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to