[
https://issues.apache.org/jira/browse/HBASE-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Feng Honghua updated HBASE-10329:
---------------------------------
Attachment: HBASE-10329-trunk_v0.patch
patch is attached and ping [~stack] :-)
> Fail the writes rather than proceeding silently to prevent data loss when
> AsyncSyncer encounters null writer and its writes aren't synced by other
> Asyncer
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-10329
> URL: https://issues.apache.org/jira/browse/HBASE-10329
> Project: HBase
> Issue Type: Bug
> Components: regionserver, wal
> Reporter: Feng Honghua
> Assignee: Feng Honghua
> Attachments: HBASE-10329-trunk_v0.patch
>
>
> Last month after I introduced multiple AsyncSyncer threads to improve the
> throughput for lower number clients, [~stack] encountered a NPE while doing
> the test where null writer occues in AsyncSyncer when doing sync. Since we
> have run many times test in cluster to verify the throughput improvement, and
> never encountered such NPE, it really confused me. (and [~stack] fix this by
> adding a 'if (writer != null)' to protect the sync operation)
> I always wonder why the write can be null in AsyncSyncer and whether it's
> safe to fix by just adding a null check before doing sync, as [~stack] did.
> After some dig and analysis, I find out the case where AsyncSyncer can
> encounter null writer, it is as below:
> 1. t1: AsyncWriter append writes to hdfs, triggers AsyncSyncer 1 with
> writtenTxid==100
> 2. t2: AsyncWriter append writes to hdfs, triggers AsyncSyncer 2 with
> writtenTxid==200
> 3. t3: rollWriter starts, it grabs the updateLock which prevents further
> writes entering pendingWrites, and then waits for all items(till 200) in
> pendingWrites to append and finally sync to hdfs
> 4. t4: AsyncSyncer 2 finishes, now syncedTillHere==200, it also help sync
> <=100 as a whole
> 5. t5: rollWriter close writer, set writer=null...
> 6. t6: AsyncSyncer 1 starts to do sync and finds writer is null... before
> rollWriter set writer to the newly created Writer
> We can see:
> 1. the null writer is possible only after there are multiple AsyncSyncer
> threads, that's why we never encountered it before introducing multiple
> AsyncSyncer threads.
> 2. since rollWriter can set writer=null only after all items of pendingWrites
> sync to hdfs, and AsyncWriter is in the critical path of this task and there
> is only one AsyncWriter thread, so AsyncWriter can't encounter null writer,
> that's why we never encounter null writer in AsyncWriter though it uses
> writer as well. This is the same reason as why null writer never occurs when
> there is a single AsyncSyncer thread.
> And we should treat differently when writer == null in AsyncSyncer:
> 1. if txidToSync <= syncedTillHere, this means all writes this AsyncSyncer
> care about have already by synced by other AsyncSyncer, we can safely ignore
> sync(as [~stack] does here);
> 2. if txidToSync > syncedTillHere, we need fail all the writes with txid <=
> txidToSync to avoid data loss: user get successful write response but can't
> read out the writes, from user's perspective this is data loss (according to
> above analysis, such case should not occur, but we still should add such
> defensive treatment to prevent data loss if it really occurs, such as by some
> bug introduced later)
> also fix the bug where isSyncing needs to reset to false when writer.sync
> encounters IOException: AsyncSyncer swallows such exception by failing all
> writes with txid<=txidToSync, and this AsyncSyncer thread is now ready to do
> later sync, its isSyncing needs to be reset to false in the IOException
> handling block
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)