Thanks for the feedback, everyone.

I think I've incorporated all of the feedback and I added a ton more
background information. If you've wanted to know how this stuff works
but were afraid to ask, now's your chance!

I've also added a proposed solution and roadmap at the end.


On Thu, Oct 6, 2016 at 8:05 AM, Jean-Daniel Cryans <[email protected]> wrote:
> On Wed, Oct 5, 2016 at 10:03 PM, Todd Lipcon <[email protected]> wrote:
>
>> Did some more research on Ceph as well as what RocksDB is doing, and have
>> another brainstorm idea:
>>
>> What if we added a file descriptor LRU cache to the File Block Manager?
>> Frequently accessed data would (a) probably in the block cache, or (b) at
>> least be already open in the FD LRU cache. In the case that it is neither,
>> maybe the cost of the open() syscall isn't so bad?
>>
>> One of the places where this might fail is the fact that we have to read a
>> lot of file headers/footers when a tablet is opened. Typically these reads
>> fall in the first (or last) few KB of the file. Given that, what about some
>> kind of hybrid scheme:
>>
>> - We have a "fixed-size" allocator supporting only 8KB allocations. Thus
>> there is no fragmentation issue and reclaiming space is trivial.
>> - For each block, we allocate 8KB which stores the first 4KB of data as
>> well as the last 4KB of data.
>> - The "middle" of the file is stored as a normal file system file.
>>
>> At startup, we can easily do a big fadvise() to page in all the block
>> footers and headers so that the initial load time is low (probably lower
>> than today!). Then we rely on setting ulimit relatively high (eg 100k) and
>> a pretty big LRU cache of fds so that warm blocks are kept open.
>>
>> One downside is that we still put pressure on the inode table of the file
>> system, and operations like "rm -Rf /data" are super slow. That's not a
>> huge deal, though, IMO.
>>
>
> You know my opinion on that :)
>
> Another downside is that switching block manager will have an impact on
> migration?
>
>
>>
>> -Todd
>>
>> On Wed, Oct 5, 2016 at 5:08 PM, Todd Lipcon <[email protected]> wrote:
>>
>> > Thanks for posting this.
>> >
>> > It's worth taking a look at what some other systems have done as well. I
>> > just spent some time looking at Ceph, and sounds like they ran into
>> similar
>> > issues and moved to a raw-disk based idea: http://www.slideshare.
>> > net/sageweil1/bluestore-a-new-faster-storage-backend-for-ceph-63311181
>> >
>> > I'll keep investigating how their actual allocator works. I dont think a
>> > raw disk is usable for Kudu, but maybe some ideas would translate.
>> >
>> > -Todd
>> >
>> > On Wed, Oct 5, 2016 at 1:53 PM, Adar Dembo <[email protected]> wrote:
>> >
>> >> I've written up a doc that summarizes two major issues related to hole
>> >> punching in Kudu's log block manager, as well as some approaches for
>> >> fixing them. Please take a look if you're interested in the subject;
>> >> your feedback is welcome.
>> >>
>> >> https://s.apache.org/uOOt
>> >>
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>

Reply via email to