[Impala-ASF-CR] IMPALA-9224: Blacklist nodes with faulty disk for spilling

Thomas Tauber-Marshall (Code Review) Fri, 29 Jan 2021 11:11:27 -0800

Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16949 )


Change subject: IMPALA-9224: Blacklist nodes with faulty disk for spilling
......................................................................


Patch Set 12:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/16949/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16949/9//COMMIT_MSG@7
PS9, Line 7:  Blackli
This patch seems pretty self-contained, so I'm not sure there needs to be a 
"part 2", vs. filing separate JIRAs for the other tasks we have planned, but if 
there is something that you think should be included as a "part 2" could you 
briefly describe what that is in the message below.


http://gerrit.cloudera.org:8080/#/c/16949/9//COMMIT_MSG@13
PS9, Line 13: the executor hitted blacklistable error during spill-to-disk.
It would be good to add a brief description of the flag that you're adding here.


http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/io/error-converter.cc
File be/src/runtime/io/error-converter.cc:

http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/io/error-converter.cc@62
PS12, Line 62:   static const set<int32_t> non_blacklistable_disk_error_codes = 
{
I think its safer to enumerate out the errnos that we want to blacklist for, 
rather than the ones we don't want to blacklist for, so that the default 
behavior is more conservative - eg. if a new error code got added by the 
underlying system such that we don't know about it, the default behavior should 
be to not blacklist for that code.


http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/query-state.cc@71
PS12, Line 71: DEFINE_bool(reporting_disk_health_enabled, false,
I know I was the one who told you to add this, but I realized we already have a 
flag to turn this off if problems occur (--blacklisting_enabled) and we 
probably don't need to have separate flags for individually turning on and off 
each blacklisting scenario.

Plus, with the new error code stuff I feel more comfortable that this patch 
isn't going to have unintended consequences.

If you would still prefer to keep it for now, I would recommend marking it as 
"experimental" rather than "advanced" as that gives us more room to remove it 
in the future.


http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/tmp-file-mgr.cc
File be/src/runtime/tmp-file-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/tmp-file-mgr.cc@679
PS12, Line 679:     if (ErrorConverter::IsBlacklistableError(status)) {
So we're not considering 'num_blacklisted_files_' here. I guess the point is 
that for writes a single failure doesn't necessarily fail the query, since we 
can try to write to the other files, but for reads a single failure does fail 
the overall query, cause we only wrote this once, so there's no where else to 
try reading it from?

I'm concerned that may make us a bit too aggressive about blacklisting, so 
maybe we should only do the blacklisting on the write path? After all, if a 
read fails because the disk really is completely faulty, we'll figure that out 
on the next query that spills when it can't do any writing.


http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/tmp-file-mgr.cc@797
PS12, Line 797:           if (ErrorConverter::IsBlacklistableError(err)) {
I think the logic here is that if a single one of the errors that we got was 
blacklistable then we blacklist it, but do we maybe want it to be that we 
require that all errors are blacklistable before blacklisting.

If any one of the errors was recoverable (i.e. not blacklistable) then seems 
like our expectation is that future queries really could succeed.


http://gerrit.cloudera.org:8080/#/c/16949/12/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/16949/12/tests/custom_cluster/test_blacklist.py@326
PS12, Line 326:     assert results.success
I think you should be able to check that the runtime profile contains the 
"Blacklisted Executors" line with the corresponding backend.


http://gerrit.cloudera.org:8080/#/c/16949/12/tests/custom_cluster/test_blacklist.py@328
PS12, Line 328:     client.close_query(handle)
Maybe add a sleep that's long enough to see the BE get removed from the 
blacklist and then run another query that doesn't have the debug action and 
show that the profile shows it running on all 3 backends.



--
To view, visit http://gerrit.cloudera.org:8080/16949
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I04bfcb7f2e0b1ef24a5b4350f270feecd8c47437
Gerrit-Change-Number: 16949
Gerrit-PatchSet: 12
Gerrit-Owner: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Comment-Date: Fri, 29 Jan 2021 19:11:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9224: Blacklist nodes with faulty disk for spilling

Reply via email to