[ 
https://issues.apache.org/jira/browse/HBASE-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Honghua updated HBASE-10329:
---------------------------------

    Attachment: HBASE-10329-trunk_v0.patch

patch is attached and ping [~stack] :-)

> Fail the writes rather than proceeding silently to prevent data loss when 
> AsyncSyncer encounters null writer and its writes aren't synced by other 
> Asyncer
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10329
>                 URL: https://issues.apache.org/jira/browse/HBASE-10329
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, wal
>            Reporter: Feng Honghua
>            Assignee: Feng Honghua
>         Attachments: HBASE-10329-trunk_v0.patch
>
>
> Last month after I introduced multiple AsyncSyncer threads to improve the 
> throughput for lower number clients, [~stack] encountered a NPE while doing 
> the test where null writer occues in AsyncSyncer when doing sync. Since we 
> have run many times test in cluster to verify the throughput improvement, and 
> never encountered such NPE, it really confused me. (and [~stack] fix this by 
> adding a 'if (writer != null)' to protect the sync operation)
> I always wonder why the write can be null in AsyncSyncer and whether it's 
> safe to fix by just adding a null check before doing sync, as [~stack] did. 
> After some dig and analysis, I find out the case where AsyncSyncer can 
> encounter null writer, it is as below:
> 1. t1: AsyncWriter append writes to hdfs, triggers AsyncSyncer 1 with 
> writtenTxid==100
> 2. t2: AsyncWriter append writes to hdfs, triggers AsyncSyncer 2 with 
> writtenTxid==200
> 3. t3: rollWriter starts, it grabs the updateLock which prevents further 
> writes entering pendingWrites, and then waits for all items(till 200) in 
> pendingWrites to append and finally sync to hdfs
> 4. t4: AsyncSyncer 2 finishes, now syncedTillHere==200, it also help sync 
> <=100 as a whole
> 5. t5: rollWriter close writer, set writer=null...
> 6. t6: AsyncSyncer 1 starts to do sync and finds writer is null... before 
> rollWriter set writer to the newly created Writer
> We can see:
> 1. the null writer is possible only after there are multiple AsyncSyncer 
> threads, that's why we never encountered it before introducing multiple 
> AsyncSyncer threads.
> 2. since rollWriter can set writer=null only after all items of pendingWrites 
> sync to hdfs, and AsyncWriter is in the critical path of this task and there 
> is only one AsyncWriter thread, so AsyncWriter can't encounter null writer, 
> that's why we never encounter null writer in AsyncWriter though it uses 
> writer as well. This is the same reason as why null writer never occurs when 
> there is a single AsyncSyncer thread.
> And we should treat differently when writer == null in AsyncSyncer:
> 1. if txidToSync <= syncedTillHere, this means all writes this AsyncSyncer 
> care about have already by synced by other AsyncSyncer, we can safely ignore 
> sync(as [~stack] does here);
> 2. if txidToSync > syncedTillHere, we need fail all the writes with txid <= 
> txidToSync to avoid data loss: user get successful write response but can't 
> read out the writes, from user's perspective this is data loss (according to 
> above analysis, such case should not occur, but we still should add such 
> defensive treatment to prevent data loss if it really occurs, such as by some 
> bug introduced later)
> also fix the bug where isSyncing needs to reset to false when writer.sync 
> encounters IOException: AsyncSyncer swallows such exception by failing all 
> writes with txid<=txidToSync, and this AsyncSyncer thread is now ready to do 
> later sync, its isSyncing needs to be reset to false in the IOException 
> handling block



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to