Adar Dembo has posted comments on this change.

Change subject: Allow for reserving disk space for non-Kudu processes
......................................................................


Patch Set 4:

(16 comments)

I reviewed everything but the LBM changes. I've indicated enough high-level 
issues that I think we should discuss them first, in case they require some 
reimplementation.

http://gerrit.cloudera.org:8080/#/c/3135/4//COMMIT_MSG
Commit Message:

Line 13: reservation limit then a crash will result.
Why did you choose for the behavior to be a crash instead of just returning an 
error up the stack? Granted, I doubt we can handle these errors in a meaningful 
way during a flush (if we can't write to disk, we'll run out of memory 
eventually), but we certainly can on a WAL write or a compaction.

In general I think "leaf" code should return errors and "non-leaf" code should, 
if necessary, crash. That way when we do crash we have more context: not only 
do we know we ran out of disk space, but we know we were in the middle of e.g. 
a flush.


http://gerrit.cloudera.org:8080/#/c/3135/4/src/kudu/integration-tests/disk_reservation-itest.cc
File src/kudu/integration-tests/disk_reservation-itest.cc:

Line 55: // Note: This functionality is not implemented in the file block 
manager, so
       : // this test is disabled on non-Linux platforms.
Can't we compile the code on all platforms but, at runtime, test the kind of 
block manager we're using, and no-op the test if it's not the LBM?


Line 67:   workload.set_num_write_threads(4);
       :   workload.set_timeout_allowed(true);
       :   workload.set_write_timeout_millis(500);
How did you arrive at these non-default values? If they're important, they're 
worth comments.

num_replicas(1) I understand (you've only got one TS).


Line 73:   // Write a few rows to get some rows flushing.
       :   while (workload.rows_inserted() < 10) {
       :     SleepFor(MonoDelta::FromMilliseconds(10));
       :   }
Why do we need to do this if we're also looping and waiting for containers 
below?


Line 80:   // Wait until we have 2 active containers on TS1.
Nit: TS1 is the only TS; why refer to it as TS1? Below too.


Line 95:                                          1L * 1024 * 1024 * 1024)));
1GB seems unnecessarily high. How about 128 MB, or even lower? How long does 
this test take to run on a spinning disk? On an SSD?


Line 126:   ts_flags.push_back("--enable_maintenance_manager=false"); // No 
flushing for this test.
Why is this important?


Line 132:   workload.set_timeout_allowed(true);
        :   workload.set_write_timeout_millis(500);
        :   workload.set_num_write_threads(8);
        :   workload.set_write_batch_size(1024);
As above, please justify the non-default values with comments.


http://gerrit.cloudera.org:8080/#/c/3135/4/src/kudu/util/env.h
File src/kudu/util/env.h:

Line 212:   // Returns information about a mounted filesystem.
No unit test in env-test for StatVfs?


Line 214:   virtual Status StatVfs(const std::string& path, struct statvfs* 
buf) = 0;
Not a fan of leaking a POSIX-specific struct into the rest of Kudu. AFAICT this 
is a first.

Could you instead adapt what you need from statvfs into a new Kudu struct or 
class?


http://gerrit.cloudera.org:8080/#/c/3135/4/src/kudu/util/env_posix.cc
File src/kudu/util/env_posix.cc:

Line 993:   virtual Status StatVfs(const std::string& path, struct statvfs* 
buf) OVERRIDE {
Need to add ThreadRestrictions::AssertIOAllowed(). Also should add a 
TRACE_EVENT() call.


Line 997:       return Status::IOError("statvfs: " + path, 
ErrnoToString(errno), errno);
You should use the IOError() function so that FLAGS_suicide_on_eio will be 
honored.


http://gerrit.cloudera.org:8080/#/c/3135/4/src/kudu/util/env_util.cc
File src/kudu/util/env_util.cc:

Line 37: DEFINE_int64(disk_reserved_bytes, 0,
       :              "Number of bytes to reserve on each filesystem for 
non-Kudu usage");
I don't know whether this will be sufficient. I'm thinking we may want to 
configure this separately for WALs and for data blocks. The rationale being: 
WALs live on one device and that's typically a fast, non-spinning device (like 
an SSD or NVRAM). Said device is often smaller and may be used for the OS too, 
so you'd want to handle reservations differently for it. I'm guessing HDFS 
doesn't work this way because all it stores are blocks, and so its data 
directories are uniform more often than not.

This means the reservations should be handled separately in the block managers 
and in the WAL code, and then there's no real use for this centralized code.


Line 43: Number of inodes to reserve on each filesystem for non-Kudu usage"
Why would an operator care to reserve the number of inodes? Do other systems do 
this? What's their rationale?


Line 125:   if (PREDICT_FALSE(FLAGS_disk_reserved_bytes_free_for_testing > -1)) 
{
If these for_testing flags are set, why bother with the StatVfs() call at all?


http://gerrit.cloudera.org:8080/#/c/3135/4/src/kudu/util/env_util.h
File src/kudu/util/env_util.h:

Line 45: Status AssertSufficientDiskSpace(Env *env, const std::string& path, 
int64_t bytes);
Nit: maybe replace "Assert" with "Check" or "Verify"? When I see assert I think 
of a gtest assert.


-- 
To view, visit http://gerrit.cloudera.org:8080/3135
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifd0451d4dbddc1783019a53302de0263080939c7
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

Reply via email to