hangc0276 commented on issue #2943:
URL: https://github.com/apache/bookkeeper/issues/2943#issuecomment-1086446251


   ### Motivation
   #### Ledger read/write logic
   When the BookKeeper server receives a write entry request, it will write the 
entry into the memory table, which is a bookie-level cache. After the memory 
table is full, it will sort and trigger a flush into the operating system 
PageCache. The operating system PageCache will buffer that data again. When the 
PageCache flush is triggered, the data will be flushed to disk.
   
   When the BookKeeper server receives a read entry request, it will check 
whether the memory table and read cache contains the target entry. If both 
caches are missed, it will query rocksDB to get the entry's location in the 
entry log file, and then read the target entry from the entry log file. After 
reading the entry, the read cache will pre-read more entries from the entry log 
files to ensure the following read keeps high cache hit rate. From the 
operating system perspective, when reading a log file from a specific position, 
it will check whether the target data has been cached in PageCache. If 
PageCache hits, it will be directly returned from PageCache. Otherwise, it will 
read the block data from the log file and cache the block data into PageCache. 
The block may contain multiple entries located near the target read position.
   
   
   #### Drawbacks
   For ledger writes, it will limit the write throughput for the following 
reasons.
   1. The memtable and OS PageCache will double buffer entry data, which will 
consume more memory. 
   2. The flush mechanism of PageCache is controlled by the kernel and it's 
hard to tune by application, which is very important for IO intensive 
applications. 
   3. The number of kernel sync threads is limited by the number of disks, 
which is not conducive to RAID composed of multiple disks.
   
   For ledger reads, it will also limit the read throughput for the following 
reasons.
   1. When reading entry data from a log file, the OS will prefetch data and 
store them into PageCache. When a lot of topics from Pulsar fetch historical 
cold data, it will trigger fetch data from a lot of log files at the same time 
and a lot of data will be pre-fetched into PageCache. Due to ledger file 
special organization, which sorts and writes a lot of ledgers into the same 
file, the prefetched data may not belong to the target ledger, which will waste 
a lot of memory and reduce the PageCache hit rate.
   2. After the OS pre-fetched a lot of data into PageCache, the eviction is 
also a big problem. The PageCache default eviction policy is LRU, it can't be 
controlled by Application except when we re-compile the kernel. We can't 
control which entries will be evicted and when to evict them.
   
   
   ### Proposal
   Based on the above issues, we introduce an optional support to bypass the 
operating system PageCache on supported systems (currently Linux and MacOS) by 
using the open(2) (https://man7.org/linux/man-pages/man2/open.2.html) flag 
O_DIRECT. fallocate(2) (https://man7.org/linux/man-pages/man2/fallocate.2.html) 
will be used, if available, to request that the filesystem allocate the 
required space before data is written.
   
   The implementation uses JNI to do direct I/O to files via posix syscalls. 
Fallocate is used if running on linux, otherwise this is skipped (at the cost 
of more filesystem operates during writing).
   
   There are two calls to write, writeAt and writeDelimited. I expect writeAt 
to be used for the entrylog headers, which entries will go through 
writeDelimited. In both cases, the calls may return before the syscalls occur. 
#flush() needs to be called to ensure things are
   actually written.
   
   The entry log format isn't much changed from what is used by the existing 
entrylogger. The biggest difference is the padding. Direct I/O must be written 
in aligned blocks. The size of the alignment varies by machine configuration, 
but 4K is a safe bet on most. As it is unlikely that entry data will land 
exactly on the alignment boundary, we need to add padding to writes. The 
existing entry logger has been changed to take this padding into account. When 
read as a signed int/long/byte the padding will always parse to a negative 
value, which distinguishes it from valid entry data (the entry size will always 
be positive) and also from preallocated space (which is always 0).
   
   Another difference in the format is that the header is now 4K rather than 
1K. Again, this is to allow aligned rights. No changes are necessary to allow 
the existing entry logger to deal with the header change, as we create a dummy 
entry in the extra header space that the existing entry logger already knows to 
ignore.
   
   We have designed a writeBuffer pool to hold the write entries and flush them 
to disk when the buffer is full. For entry reading, each entry log file has a 
reader to deal with reading. The reader is managed by a cache backed with an 
eviction policy. Each read has a specific size read buffer to hold read data.
   
   To enable this, set dbStorage_directIOEntryLogger=true in the configuration.
   
   ### Changes
   1. Add bookkeeper-slogger module to provide support for structured logging 
with a pluggable logging backend. Provide an implementation using SLF4J.
   2. Add native-io package to provide JNI bindings to operating system I/O api.
   3. Introduce entry logger interface to support multi-implement of entry 
logger. Current support for PageCache based implementation and direct-io based 
implementation.
   4. Add direct-io based implementation DirectEntryLogger, which is enabled by 
flag `dbStorage_directIOEntryLogger`
   5. Refactor garbage collection and compaction to allow the entry logger to 
control which files are available to be garbage collected.
   
   ### Implementation
   For part 1,2,3,5, we will push individual PRs. For part 4, we are trying to 
split into two PRs, one for writing, another for reading.
   
   ### Compatibility, Deprecation, and Migration Plan
   We just modified the read and write logic of the entry log file, and didn't 
modify the organization of it.
   
   So no compatibility concerns at this moment.
   
   
   ### Test Plan
   We will add tests for the following module.
   1. BookKeeper-slogger 
   2. Native-io 
   3. Direct-io based implementation DirectEntryLogger
   5. Garbage collection and compaction based on DirectEntryLogger
   
   ### Others
   I’m doing performance testing for the direct-io based implementation.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to