Did some more research on Ceph as well as what RocksDB is doing, and have
another brainstorm idea:

What if we added a file descriptor LRU cache to the File Block Manager?
Frequently accessed data would (a) probably in the block cache, or (b) at
least be already open in the FD LRU cache. In the case that it is neither,
maybe the cost of the open() syscall isn't so bad?

One of the places where this might fail is the fact that we have to read a
lot of file headers/footers when a tablet is opened. Typically these reads
fall in the first (or last) few KB of the file. Given that, what about some
kind of hybrid scheme:

- We have a "fixed-size" allocator supporting only 8KB allocations. Thus
there is no fragmentation issue and reclaiming space is trivial.
- For each block, we allocate 8KB which stores the first 4KB of data as
well as the last 4KB of data.
- The "middle" of the file is stored as a normal file system file.

At startup, we can easily do a big fadvise() to page in all the block
footers and headers so that the initial load time is low (probably lower
than today!). Then we rely on setting ulimit relatively high (eg 100k) and
a pretty big LRU cache of fds so that warm blocks are kept open.

One downside is that we still put pressure on the inode table of the file
system, and operations like "rm -Rf /data" are super slow. That's not a
huge deal, though, IMO.

-Todd

On Wed, Oct 5, 2016 at 5:08 PM, Todd Lipcon <[email protected]> wrote:

> Thanks for posting this.
>
> It's worth taking a look at what some other systems have done as well. I
> just spent some time looking at Ceph, and sounds like they ran into similar
> issues and moved to a raw-disk based idea: http://www.slideshare.
> net/sageweil1/bluestore-a-new-faster-storage-backend-for-ceph-63311181
>
> I'll keep investigating how their actual allocator works. I dont think a
> raw disk is usable for Kudu, but maybe some ideas would translate.
>
> -Todd
>
> On Wed, Oct 5, 2016 at 1:53 PM, Adar Dembo <[email protected]> wrote:
>
>> I've written up a doc that summarizes two major issues related to hole
>> punching in Kudu's log block manager, as well as some approaches for
>> fixing them. Please take a look if you're interested in the subject;
>> your feedback is welcome.
>>
>> https://s.apache.org/uOOt
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to