[
https://issues.apache.org/jira/browse/BOOKKEEPER-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487659#comment-13487659
]
Ivan Kelly commented on BOOKKEEPER-432:
---------------------------------------
Thanks for clarifying Yixue. It would be good if the doc separated the btree
and skiplist approaches completely, but I this I understand the overall jist.
Both the designs seem to be motivated by quick access to individual entries. I
think this motivation isn't fully correct. Really what we need is quick access
to a suffix of entries.
Lets say a flush takes 60 seconds, for 1000 ledgers, 1KB entries, writing at
50k per second. For an individual ledger, there will be ((60*50k)/1000) = 3000
entries, or roughly 3MB of data. The offset of the block has to written to the
index. A 7.2k rpm disk can read sequentially at 100MB/s, and can do about 120
seeks a second. We need one seek to get to the block in any case. Now, if want
to get to the precise entry, we need to seek again. Or we can just read the
whole block. Which is quicker? I would say they're roughly the same, but one
requires much more complicated data structures.
Now, if a flush takes longer, the tradeoff changes, and I have seen flushes
taking longer before, but if we change the index from storing the every entry,
to just the start of a sequence of entries, flushes should get faster. And even
then, if we want to add skips, we should add extra entries in the index rather
than in the entrylog, as that is why we have an index. Having an auxiliary
index negates the need for complex data structures in the entrylog itself.
> Improve performance of entry log range read per ledger entries
> ---------------------------------------------------------------
>
> Key: BOOKKEEPER-432
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-432
> Project: Bookkeeper
> Issue Type: Improvement
> Components: bookkeeper-server
> Affects Versions: 4.2.0
> Environment: Linux
> Reporter: Yixue (Andrew) Zhu
> Assignee: Yixue (Andrew) Zhu
> Labels: patch
> Attachments: BookieLedgerStorageProposal.pdf
>
>
> We observed random I/O reads when some subscribers fall behind (on some
> topics), as delivery needs to scan the entry logs (thru ledger index), which
> are interleaved with ledger entries across all ledgers being served.
> Essentially, the ledger index is a non-clustered index. It is not effective
> when a large number of ledger entries need to be served, which tend to be
> scattered around due to interleaving.
> Some possible improvements:
> 1. Change the ledger entries buffer to use a SkipList (or other suitable),
> sorted on (ledger, entry sequence). When the buffer is flushed, the entry log
> is written out in the already-sorted order.
> The "active" ledger index can point to the entries buffer (SkipList), and
> fixed up with entry-log position once latter is persisted.
> Or, the ledger index can be just rebuilt on demand. The entry log file tail
> can have index attached (light-weight b-tree, similar with big-table). We
> need to track per ledger which log files contribute entries to it, so that
> in-memory index can be rebuilt from the tails of corresponding log files.
> 2. Use affinity concept to make ensembles of ledgers (belonging to same
> topic) as identical as possible. This will help above 1. be more effective.
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira