On Wed, Oct 5, 2016 at 10:03 PM, Todd Lipcon <[email protected]> wrote:
> Did some more research on Ceph as well as what RocksDB is doing, and have > another brainstorm idea: > > What if we added a file descriptor LRU cache to the File Block Manager? > Frequently accessed data would (a) probably in the block cache, or (b) at > least be already open in the FD LRU cache. In the case that it is neither, > maybe the cost of the open() syscall isn't so bad? > > One of the places where this might fail is the fact that we have to read a > lot of file headers/footers when a tablet is opened. Typically these reads > fall in the first (or last) few KB of the file. Given that, what about some > kind of hybrid scheme: > > - We have a "fixed-size" allocator supporting only 8KB allocations. Thus > there is no fragmentation issue and reclaiming space is trivial. > - For each block, we allocate 8KB which stores the first 4KB of data as > well as the last 4KB of data. > - The "middle" of the file is stored as a normal file system file. > > At startup, we can easily do a big fadvise() to page in all the block > footers and headers so that the initial load time is low (probably lower > than today!). Then we rely on setting ulimit relatively high (eg 100k) and > a pretty big LRU cache of fds so that warm blocks are kept open. > > One downside is that we still put pressure on the inode table of the file > system, and operations like "rm -Rf /data" are super slow. That's not a > huge deal, though, IMO. > You know my opinion on that :) Another downside is that switching block manager will have an impact on migration? > > -Todd > > On Wed, Oct 5, 2016 at 5:08 PM, Todd Lipcon <[email protected]> wrote: > > > Thanks for posting this. > > > > It's worth taking a look at what some other systems have done as well. I > > just spent some time looking at Ceph, and sounds like they ran into > similar > > issues and moved to a raw-disk based idea: http://www.slideshare. > > net/sageweil1/bluestore-a-new-faster-storage-backend-for-ceph-63311181 > > > > I'll keep investigating how their actual allocator works. I dont think a > > raw disk is usable for Kudu, but maybe some ideas would translate. > > > > -Todd > > > > On Wed, Oct 5, 2016 at 1:53 PM, Adar Dembo <[email protected]> wrote: > > > >> I've written up a doc that summarizes two major issues related to hole > >> punching in Kudu's log block manager, as well as some approaches for > >> fixing them. Please take a look if you're interested in the subject; > >> your feedback is welcome. > >> > >> https://s.apache.org/uOOt > >> > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera >
