[
https://issues.apache.org/jira/browse/HBASE-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Purtell resolved HBASE-10329.
------------------------------------
Resolution: Fixed
Hadoop Flags: Reviewed
Green builds, resolved.
> Fail the writes rather than proceeding silently to prevent data loss when
> AsyncSyncer encounters null writer and its writes aren't synced by other
> Asyncer
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-10329
> URL: https://issues.apache.org/jira/browse/HBASE-10329
> Project: HBase
> Issue Type: Bug
> Components: regionserver, wal
> Affects Versions: 0.98.0
> Reporter: Feng Honghua
> Assignee: Feng Honghua
> Priority: Critical
> Fix For: 0.98.0, 0.99.0
>
> Attachments: 10329-0.98.txt, HBASE-10329-trunk_v0.patch
>
>
> Last month after I introduced multiple AsyncSyncer threads to improve the
> throughput for lower number client write threads, [~stack] encountered a NPE
> while doing the test where null-writer occurs in AsyncSyncer when doing sync.
> Since we have run many times test in cluster to verify the throughput
> improvement, and never encountered such NPE, it really confused me. (and
> [~stack] fixed this by adding 'if (writer != null)' to protect the sync
> operation)
> These days from time to time I wondered why the writer can be null in
> AsyncSyncer and whether it's safe to fix it by just adding a null checking
> before doing sync, as [~stack] did. After some digging, I find out the case
> where AsyncSyncer can encounter null-writer, it is as below:
> 1. t1: AsyncWriter appends writes to hdfs, triggers AsyncSyncer 1 with
> writtenTxid==100
> 2. t2: AsyncWriter appends writes to hdfs, triggers AsyncSyncer 2 with
> writtenTxid==200
> 3. t3: rollWriter starts, it grabs the updateLock to prevents further writes
> from client writes to enter pendingWrites, and then waits for all items(<=
> 200) in pendingWrites to append and finally sync to hdfs
> 4. t4: AsyncSyncer 2 finishes, now syncedTillHere==200(it also help sync
> <=100 as a whole)
> 5. t5: rollWriter now can close writer, set writer=null...
> 6. t6: AsyncSyncer 1 starts to do sync and finds the writer is null... before
> rollWriter sets writer to the newly rolled Writer
> We can see:
> 1. the null writer is possible only after there are multiple AsyncSyncer
> threads, that's why we never encountered it before introducing multiple
> AsyncSyncer threads.
> 2. since rollWriter can set writer=null only after all items of pendingWrites
> sync to hdfs, and AsyncWriter is in the critical path of this task and there
> is only one single AsyncWriter thread, so AsyncWriter can't encounter null
> writer, that's why we never encounter null writer in AsyncWriter though it
> also uses writer. This is the same reason as why null-writer never occurs
> when there is a single AsyncSyncer thread.
> And we should treat differently when writer == null in AsyncSyncer:
> 1. if txidToSync <= syncedTillHere, this means all writes this AsyncSyncer
> care about have already been synced by other AsyncSyncer, we can safely
> ignore sync(as [~stack] does here);
> 2. if txidToSync > syncedTillHere, we need fail all the writes with txid <=
> txidToSync to avoid data loss: user gets successful write response but can't
> read out the writes after getting the successful write response, from user's
> perspective this is data loss (according to above analysis, such case should
> not occur, but we still should add such defensive treatment to prevent data
> loss if it really occurs, such as by some bug introduced later)
> also fix the bug where isSyncing needs to reset to false when writer.sync
> encounters IOException: AsyncSyncer swallows such exception by failing all
> writes with txid<=txidToSync, and this AsyncSyncer thread is now ready to do
> later sync, its isSyncing needs to be reset to false in the IOException
> handling block, otherwise it can't be selected by AsyncWriter to do sync
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)