Wenzhe Zhou has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16949 )

Change subject: IMPALA-9224: Blacklist nodes with faulty disk for spilling
......................................................................


Patch Set 12:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/io/error-converter.cc
File be/src/runtime/io/error-converter.cc:

http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/io/error-converter.cc@62
PS12, Line 62:   static const set<int32_t> non_blacklistable_disk_error_codes = 
{
> I think its safer to enumerate out the errnos that we want to blacklist for
Right, we should be more conservative. Will fix it.


http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/query-state.cc@71
PS12, Line 71: DEFINE_bool(reporting_disk_health_enabled, false,
> I know I was the one who told you to add this, but I realized we already ha
Will remove it.


http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/tmp-file-mgr.cc
File be/src/runtime/tmp-file-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/tmp-file-mgr.cc@679
PS12, Line 679:     if (ErrorConverter::IsBlacklistableError(status)) {
> So we're not considering 'num_blacklisted_files_' here. I guess the point i
Will remove blacklisting for single reading error from this patch. I was 
thinking to change it in more conservative way in following patch, like adding 
stats in TmpFileMgr class to count disk io failures in last 10 minutes and 
blacklisting node if the count exceed a certain threshold.


http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/tmp-file-mgr.cc@797
PS12, Line 797:           if (ErrorConverter::IsBlacklistableError(err)) {
> I think the logic here is that if a single one of the errors that we got wa
Agree, will fix it.


http://gerrit.cloudera.org:8080/#/c/16949/12/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/16949/12/tests/custom_cluster/test_blacklist.py@326
PS12, Line 326:     assert results.success
> I think you should be able to check that the runtime profile contains the "
Ok, will add code to check runtime profile.


http://gerrit.cloudera.org:8080/#/c/16949/12/tests/custom_cluster/test_blacklist.py@328
PS12, Line 328:     client.close_query(handle)
> Maybe add a sleep that's long enough to see the BE get removed from the bla
Will add another query after blacklisting timeout.



--
To view, visit http://gerrit.cloudera.org:8080/16949
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I04bfcb7f2e0b1ef24a5b4350f270feecd8c47437
Gerrit-Change-Number: 16949
Gerrit-PatchSet: 12
Gerrit-Owner: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Comment-Date: Fri, 29 Jan 2021 20:32:45 +0000
Gerrit-HasComments: Yes

Reply via email to