Thanks for the feedback, everyone. I think I've incorporated all of the feedback and I added a ton more background information. If you've wanted to know how this stuff works but were afraid to ask, now's your chance!
I've also added a proposed solution and roadmap at the end. On Thu, Oct 6, 2016 at 8:05 AM, Jean-Daniel Cryans <[email protected]> wrote: > On Wed, Oct 5, 2016 at 10:03 PM, Todd Lipcon <[email protected]> wrote: > >> Did some more research on Ceph as well as what RocksDB is doing, and have >> another brainstorm idea: >> >> What if we added a file descriptor LRU cache to the File Block Manager? >> Frequently accessed data would (a) probably in the block cache, or (b) at >> least be already open in the FD LRU cache. In the case that it is neither, >> maybe the cost of the open() syscall isn't so bad? >> >> One of the places where this might fail is the fact that we have to read a >> lot of file headers/footers when a tablet is opened. Typically these reads >> fall in the first (or last) few KB of the file. Given that, what about some >> kind of hybrid scheme: >> >> - We have a "fixed-size" allocator supporting only 8KB allocations. Thus >> there is no fragmentation issue and reclaiming space is trivial. >> - For each block, we allocate 8KB which stores the first 4KB of data as >> well as the last 4KB of data. >> - The "middle" of the file is stored as a normal file system file. >> >> At startup, we can easily do a big fadvise() to page in all the block >> footers and headers so that the initial load time is low (probably lower >> than today!). Then we rely on setting ulimit relatively high (eg 100k) and >> a pretty big LRU cache of fds so that warm blocks are kept open. >> >> One downside is that we still put pressure on the inode table of the file >> system, and operations like "rm -Rf /data" are super slow. That's not a >> huge deal, though, IMO. >> > > You know my opinion on that :) > > Another downside is that switching block manager will have an impact on > migration? > > >> >> -Todd >> >> On Wed, Oct 5, 2016 at 5:08 PM, Todd Lipcon <[email protected]> wrote: >> >> > Thanks for posting this. >> > >> > It's worth taking a look at what some other systems have done as well. I >> > just spent some time looking at Ceph, and sounds like they ran into >> similar >> > issues and moved to a raw-disk based idea: http://www.slideshare. >> > net/sageweil1/bluestore-a-new-faster-storage-backend-for-ceph-63311181 >> > >> > I'll keep investigating how their actual allocator works. I dont think a >> > raw disk is usable for Kudu, but maybe some ideas would translate. >> > >> > -Todd >> > >> > On Wed, Oct 5, 2016 at 1:53 PM, Adar Dembo <[email protected]> wrote: >> > >> >> I've written up a doc that summarizes two major issues related to hole >> >> punching in Kudu's log block manager, as well as some approaches for >> >> fixing them. Please take a look if you're interested in the subject; >> >> your feedback is welcome. >> >> >> >> https://s.apache.org/uOOt >> >> >> > >> > >> > >> > -- >> > Todd Lipcon >> > Software Engineer, Cloudera >> > >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >>
