[
https://issues.apache.org/jira/browse/BOOKKEEPER-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266930#comment-15266930
]
Matteo Merli commented on BOOKKEEPER-926:
-----------------------------------------
[~jujjuri] Yes, that looks like the same issue. If the
{{leastUnflushedEntryLogId}} moves forward (eg: entries are being written,
entry log is full and new one is created), then the bookie GC thread is
essentially stuck, waiting for an older entry log to be flushed, but that entry
log was already flushed.
[~eolivelli] Yes we've seen it happening in production. I mean we've see the
bookie GC getting stuck, in the same place that [~jujjuri] pasted above.
About the other issue, of the compacted entries not being flushed on disk, it's
hard to tell whether it has already happened or not. Even if it happens, you
might not be aware of it given that :
* The other bookie replica might still have the entries
* You might not try to read these entries
[~zhaijia] The "compacted" entries are appended to the current entry log and,
before updating the index, we need to make sure the current entry log is
flushed. Otherwise, if a bookie crashes during compaction, we might have
indexes pointing to an invalid offset in the entry log.
In the current code, the current entryLogId is typically equal to the "least
unflushed entry log id". That means, given the current if condition, we are not
waiting for entries to be flushed.
I'll definitely add a unit test to trigger this behavior, and I still think
this should go into 4.4
> Compacted entries are not properly synced before updating index
> ---------------------------------------------------------------
>
> Key: BOOKKEEPER-926
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-926
> Project: Bookkeeper
> Issue Type: Bug
> Affects Versions: 4.3.1
> Reporter: Matteo Merli
> Assignee: Matteo Merli
> Fix For: 4.4.0
>
>
> I have identified a couple of issues in Bookie compaction code.
> h4. Compacted entries are not properly synced when index is updated
> When compacting, we read "active" entries from an entry log and we re-append
> to current entry log. After compacting a number of entries, by default 100K,
> or at the very end, we need to update the index pointing to the new entry log
> and offset.
> Before updating the index, we need to wait for this entries to be flushed and
> fsynced, otherwise a bookie crash might leave the index updated, pointing to
> an invalid offset.
> The current code that is supposed to wait until flushed is:
> {code:java}
> // GarbageCollectorThread.java:178
> EntryLocation lastOffset = offsets.get(offsets.size()-1);
> long lastOffsetLogId = EntryLogger.logIdForOffset(lastOffset.location);
> while (lastOffsetLogId < entryLogger.getLeastUnflushedLogId() && running) {
> synchronized (flushLock) {
> flushLock.wait(1000);
> }
> lastOffset = offsets.get(offsets.size()-1);
> lastOffsetLogId = EntryLogger.logIdForOffset(lastOffset.location);
> }
> // update the index
> {code}
> The condition {{lastOffsetLogId < entryLogger.getLeastUnflushedLogId()}} is
> wrong, because if the last compacted entry was written in an earlier entry
> log than the least unflushed log, it means that the entries are already
> flushed and thus we don't need to wait.
> In the normal case what happens is that {{lastOffsetLogId} is actually the
> current entryLog and it's equal to {{entryLogger.getLeastUnflushedLogId()}},
> so we don't wait. But, in this case the entries appended to the current
> entrylog are not flushed nor synced, hence the problem.
> h4. Exception during index flush
> Having an exception when updating the index, combined with the above issue,
> makes the bookie GC to stop indefinitely.
> What happens is that the offset list is not cleared, and at the next bookie
> GC iteration it will find the old compacted entries in that list, for which
> now the entryLogId is less than the current log id, and that makes the while
> loop to never exit.
> Another problem is that, any IOException during the index flush, will make
> the GC thread to bail out and it will not remove even the entry logs that
> were compacted and flushed. Next time, these entry logs will be compacted
> again.
> h4. Proposed solution
> I think the best solution is to trigger the {{entryLogger.flush()}} form the
> bookie GC thread before updating the indexes. That would simplify the code
> and I don't see any disadvantages in doing that.
> Another improvement would be to delete compacted entry logs individually, as
> soon as the compacted data is flush, without waiting the end of the whole
> compaction process.
> The advantages are :
> * If compaction stop halfway, at least we don't have to re-compact what we
> just compacted
> * Compaction won't require significant space overhead. Today a major
> compaction can end up reappending a large amount of data and then deleting
> all the entry logs at the very end, requiring twice the size of the active
> data set to be stored on disk at some point in time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)