On Wed, Oct 5, 2016 at 10:03 PM, Todd Lipcon <[email protected]> wrote:

> Did some more research on Ceph as well as what RocksDB is doing, and have
> another brainstorm idea:
>
> What if we added a file descriptor LRU cache to the File Block Manager?
> Frequently accessed data would (a) probably in the block cache, or (b) at
> least be already open in the FD LRU cache. In the case that it is neither,
> maybe the cost of the open() syscall isn't so bad?
>
> One of the places where this might fail is the fact that we have to read a
> lot of file headers/footers when a tablet is opened. Typically these reads
> fall in the first (or last) few KB of the file. Given that, what about some
> kind of hybrid scheme:
>
> - We have a "fixed-size" allocator supporting only 8KB allocations. Thus
> there is no fragmentation issue and reclaiming space is trivial.
> - For each block, we allocate 8KB which stores the first 4KB of data as
> well as the last 4KB of data.
> - The "middle" of the file is stored as a normal file system file.
>
> At startup, we can easily do a big fadvise() to page in all the block
> footers and headers so that the initial load time is low (probably lower
> than today!). Then we rely on setting ulimit relatively high (eg 100k) and
> a pretty big LRU cache of fds so that warm blocks are kept open.
>
> One downside is that we still put pressure on the inode table of the file
> system, and operations like "rm -Rf /data" are super slow. That's not a
> huge deal, though, IMO.
>

You know my opinion on that :)

Another downside is that switching block manager will have an impact on
migration?


>
> -Todd
>
> On Wed, Oct 5, 2016 at 5:08 PM, Todd Lipcon <[email protected]> wrote:
>
> > Thanks for posting this.
> >
> > It's worth taking a look at what some other systems have done as well. I
> > just spent some time looking at Ceph, and sounds like they ran into
> similar
> > issues and moved to a raw-disk based idea: http://www.slideshare.
> > net/sageweil1/bluestore-a-new-faster-storage-backend-for-ceph-63311181
> >
> > I'll keep investigating how their actual allocator works. I dont think a
> > raw disk is usable for Kudu, but maybe some ideas would translate.
> >
> > -Todd
> >
> > On Wed, Oct 5, 2016 at 1:53 PM, Adar Dembo <[email protected]> wrote:
> >
> >> I've written up a doc that summarizes two major issues related to hole
> >> punching in Kudu's log block manager, as well as some approaches for
> >> fixing them. Please take a look if you're interested in the subject;
> >> your feedback is welcome.
> >>
> >> https://s.apache.org/uOOt
> >>
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Reply via email to