[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041185#comment-15041185
 ] 

stack commented on HBASE-14790:
-------------------------------

bq. ReplicationSource should ask this length first before reading and do not 
read beyond it. If we have this logic, 

Doing this would be an improvement over current way we do replication -- less 
NN ops -- where we open the file, read till EOF, close, then do same again to 
see if anything new has been added to the file.

bq. ...we could reset the acked length if needed and then move the remaining 
operations of closing file to a background thread to reduce latency. Thoughts? 
stack

This is clean up of a broken WAL? This is being able to ask each DN what it 
thinks the length is? While this is going on, we would be holding on to the 
hbase handlers not letting response go back to the client?  Would we have to do 
some weird accounting where three clients A, B, and C and each written an edit, 
and then the length we get back from exisiting DNs after a crash say does not 
include the edit written by client C... we'll have to figure out how to fail 
client C's write (though we'd moved on from append and were trying to 
sync/hflush the append)?

bq. We could just rewrite the WAL entries after acked point to the new file, 
this could also reduce the recovery latency.

I think we can do this currently in the multi WAL case... would have to check 
(or at least one implementation that may not be the one that landed, used to do 
this). It would keep around the edits because it would have a standby WAL and 
if the current WAL was 'slow', we'd throw it away and then add the outstanding 
edits to the new WAL and away we go again (I can dig it up... )

bq. The replication source should not read beyond the length gotten from 
namenode(do not trust the visible length read from datanode). 

This would be lots of NN ops? (In a subsequent comment you say this... nvm)

bq. The advantage here is when region server crashes, we could still get this 
value from namenode, and the file will be closed eventually by someone so the 
length will finally be correct.

This would be sweet though (could do away with keeping replication lengths up 
in zk?)

bq. There will always be some situation that we could not know there is data 
loss unless we call fsync every time to update length on namenode when writing 
WAL I think. 

Yes. This is the case before your patch though. We should also get some 
experience of what its like trying fsync.'d WAL...











> Implement a new DFSOutputStream for logging WAL only
> ----------------------------------------------------
>
>                 Key: HBASE-14790
>                 URL: https://issues.apache.org/jira/browse/HBASE-14790
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to