Adar Dembo has posted comments on this change.

Change subject: fs: generate report during Open
......................................................................


Patch Set 4:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/6581/4//COMMIT_MSG
Commit Message:

PS4, Line 9: This patch modifies the FsManager and block managers to produce a 
filesystem
           : report when opened. The report includes various filesystem 
statistics as
           : well as all inconsistencies found and repaired.
           : 
           : The heart of the patch is the FsReport nested data structure. 
Originally
           : implemented as a protobuf to take advantage of DebugString(), I 
found it to
           : be a poor fit once I began customizing the log output, so I 
rewrote it in
           : its current form. The report includes fs-wide statistics as well 
as optional
           : consistency checks that may or may not be performed by the block 
managers.
           : Reports can be merged; the LBM takes advantage of this as it 
processes data
           : directories in parallel. The consistency checks have structure, so 
as to
           : simplify testing and ToString() customization.
           : 
           : Thanks to the level of detail in the FsReport, the LBM now 
separates the act
           : of finding inconsistencies from repairing them. This makes it much 
easier to
           : enforce that repairs aren't done on a read-only filesystem, or if 
there were
           : any fatal and irreparable errors.
> could you post an example of the report on the commit message?
Done


http://gerrit.cloudera.org:8080/#/c/6581/5/src/kudu/fs/fs_report.h
File src/kudu/fs/fs_report.h:

Line 30: 
> More generalized documentation of Check struct usage would be nice.
Done


http://gerrit.cloudera.org:8080/#/c/6581/4/src/kudu/fs/log_block_manager-test.cc
File src/kudu/fs/log_block_manager-test.cc:

PS4, Line 69:     // unnecessarily.
> nit no need to wrap
Generally I wrap at 80 characters, and this extends to 81 if unwrapped. But, I 
agree that a single word hanging out on its own line is distracting, so I'll 
unwrap.


http://gerrit.cloudera.org:8080/#/c/6581/4/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

PS4, Line 305: preallocated_window()
> docs
Gonna punt; a subsequent patch removes the method.


PS4, Line 324: truly catastrophic
> replace with the verbage you've been using elsewhere (i guess fatal and irr
This isn't for fatal and irreparable inconsistencies though; this is for "we 
have to stop NOW" errors. There's just one such case right now: not being able 
to get the size of the container's data file. But that's why I called it "truly 
catastrophic".


PS4, Line 554: &data_file_size, &is_malformed_record));
> nit no need to wrap
See earlier comment; I wrap at 80 characters.


PS4, Line 571: pb_reader.offset());
> same
See earlier.


PS4, Line 584: wildly
> what's "wildly incorrect"?
Non-existent or negative values (what we're testing for in the subsequent 
condition).


PS4, Line 1652: We are going to perform all of these checks
> are these all the possible checks? if so please say so, if not explain why
These aren't all of the checks. We don't emplace the missing/orphaned block 
checks, because the LBM can't do them on its own.

I'll add a comment to that effect but I prefer to keep this somewhat vague so 
that it won't get stale as the set of possible checks changes.


PS4, Line 1726: Truncation is performed even if the container's logical file 
size
              :       // (available by proxy via preallocated_offset_) and 
total_bytes_written_
              :       // agree, because XFS's speculative preallocation feature 
may artificially
              :       // enlarge the file without updating its file size.
> mention that this is XFS specific earlier in the comment
Why? It's just one sentence and you can't read it without seeing "XFS 
speculative preallocation" in there.


PS4, Line 1753: // TODO(adar): track as an inconsistency?
> yeah. can we do that as part of this patch? how "possible" is this? should 
I hesitated to track it as an inconsistency for it for a couple reasons:
1. It's scoped to multiple containers (other inconsistencies are all single 
container), so a little more complicated.
2. I can't think of a "natural" way it'd occur in the wild. The only 
possibilities involve operator error (i.e. duplicating a container).

We can always convert it into an inconsistency in the future.


PS4, Line 1821: report->malformed_record_check->entries.emplace_back(
              :             container->ToString(), record);
> could we provide more information as to what exactly the malformedness was?
If we wanted that, I'd probably track it as a different kind of inconsistency 
altogether (instead of malformed record).

If you feel strongly, I'll do that now; otherwise I was planning to abide by 
the TODO and leave it...to do.


PS4, Line 1841: if (read_only_ || report->HasFatalErrors()) {
              :     return Status::OK();
              :   }
> LOG(WARNING) here
Done


PS4, Line 1878: / Technically we've "repaired" the inconsistency if the 
truncation
              :       // succeeded, even if the following logic fails.
              :       pr.repaired = true;
> did we? even if we never ended up syncing to disk? (I'm assuming that ftrun
You're correct that ftruncate() won't fsync(), but the only  circumstance in 
which an fsync() won't eventually happen by virtue of the fs' periodic sync is 
one where the machine crashes. In that circumstance, the next time the LBM 
starts up it'll detect the inconsistency again, repair it again, etc.


http://gerrit.cloudera.org:8080/#/c/6581/4/src/kudu/tools/tool_action_fs.cc
File src/kudu/tools/tool_action_fs.cc:

PS4, Line 126:   // Add missing blocks to the report.
             :   report.missing_block_check.emplace();
             :   for (const auto& id : missing_block_ids) {
             :     report.missing_block_check->entries.emplace_back(
             :         id, FindOrDie(live_block_id_to_tablet, id));
             :   }
             : 
             :   // Add orphaned blocks to the report after attempting to 
repair them.
             :   report.orphaned_block_check.emplace();
             :   for (const auto& id : orphaned_block_ids) {
             :     // Opening a block isn't free, but the number of orphaned 
blocks shouldn't
             :     // be extraordinarily high.
             :     uint64_t size;
             :     {
             :       unique_ptr<ReadableBlock> block;
             :       RETURN_NOT_OK(fs_manager.OpenBlock(id, &block));
             :       RETURN_NOT_OK(block->Size(&size));
             :     }
             :     fs::OrphanedBlockCheck::Entry entry(id, size);
             : 
             :     if (FLAGS_repair) {
             :       Status s = fs_manager.DeleteBlock(id);
             :       WARN_NOT_OK(s, "Could not delete orphaned block");
             :       if (s.ok()) {
             :         entry.repaired = true;
             :       }
             :     }
             :     report.orphaned_block_check->entries.emplace_back(entry);
             :   }
             :   return report.PrintAndCheckForFatalErrors();
> should this logic be tool specific?
Not sure I understand your question; it is specific to "fs check" by virtue of 
being inside Check() and not anywhere else.


-- 
To view, visit http://gerrit.cloudera.org:8080/6581
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Idcc1d512e2bed103c8d41623a63b4d071532b35e
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

Reply via email to