Todd Lipcon has posted comments on this change. ( )

Change subject: KUDU-2693. Buffer blocks while flushing rowsets

Patch Set 1:

Commit Message:
PS1, Line 13: This means that, instead of each flush
            : requiring its own LBM container, many of the output blocks can be
            : consolidated into a small number of new containers (typically one 
            : disk).
> While certainly an improvement over the random block-to-container assignmen
Yea, the skepticism there is that, even if we put the blocks for a column in 
one container, we might not be flushing in exactly key order, so doing so still 
doesn't allow you to just do a big sequential read of that column.

The "row-oriented"-ness here is sorta half true. What we end up with is 
typically referred to as "PAX" -- essentially you divide into blocks of rows, 
and within the block, its' columnar. Parquet calls these blocks "row groups". 
We call them "DRS". Whether we lay out the column chunks in one container or 
across a bunch isn't really too different, right? (Parquet always puts them in 
a single file)
PS1, Line 27: up to 2x overhead from buffer size-doubling
> Not following this bit; there's just one buffer per writable block, so wher
just because append()ing 10 bytes to a 32kb buffer will produce a 64kb buffer. 
In the limit, a faststring may have a capacity twice its current length. I'll 
try to clarify.

We could improve this max overhead by using an iovec-like chain of smaller 
buffers of course but I was lazy so I went with the quickest prototype for a 
PS1, Line 42:    "data dirs.queue_time_us" : 1847,
            :    "data dirs.run_cpu_time_us" : 409,
            :    "data dirs.run_wall_time_us" : 21185,
> Were these edited out of "with the change"? If so, could drop them from her
hmm I actually don't recall if I edited them out or they just weren't there.
PS1, Line 66:    "lbm_writes_lt_1ms" : 1419
> This is almost an order of magnitude less than before the change. Any thoug
total number of writes is less because we are only doing one write when we 
flush the buffer, whereas before we'd call block->Append() it would do a write 
directly to the disk
PS1, Line 73: On a spinning disk the reduction from 450
            : fsyncs to 5 fsyncs is likely to be even more impactful in terms 
of wall
            : clock.
> Yeah I'd love to see some numbers on a spinning disk.
I think Grant was going to try this out on a real workload on spinning disks?

To view, visit
To unsubscribe, visit

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacc662ba812ece8e68b0ef28f4ccdf0b7475fbc0
Gerrit-Change-Number: 12425
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <>
Gerrit-Reviewer: Adar Dembo <>
Gerrit-Reviewer: Andrew Wong <>
Gerrit-Reviewer: Andrew Wong <>
Gerrit-Reviewer: Grant Henke <>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <>
Gerrit-Reviewer: Todd Lipcon <>
Gerrit-Comment-Date: Wed, 13 Feb 2019 07:15:28 +0000
Gerrit-HasComments: Yes

Reply via email to