Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/16949 )
Change subject: IMPALA-9224: Blacklist nodes with faulty disk for spilling ...................................................................... Patch Set 12: (8 comments) http://gerrit.cloudera.org:8080/#/c/16949/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16949/9//COMMIT_MSG@7 PS9, Line 7: Blackli This patch seems pretty self-contained, so I'm not sure there needs to be a "part 2", vs. filing separate JIRAs for the other tasks we have planned, but if there is something that you think should be included as a "part 2" could you briefly describe what that is in the message below. http://gerrit.cloudera.org:8080/#/c/16949/9//COMMIT_MSG@13 PS9, Line 13: the executor hitted blacklistable error during spill-to-disk. It would be good to add a brief description of the flag that you're adding here. http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/io/error-converter.cc File be/src/runtime/io/error-converter.cc: http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/io/error-converter.cc@62 PS12, Line 62: static const set<int32_t> non_blacklistable_disk_error_codes = { I think its safer to enumerate out the errnos that we want to blacklist for, rather than the ones we don't want to blacklist for, so that the default behavior is more conservative - eg. if a new error code got added by the underlying system such that we don't know about it, the default behavior should be to not blacklist for that code. http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/query-state.cc File be/src/runtime/query-state.cc: http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/query-state.cc@71 PS12, Line 71: DEFINE_bool(reporting_disk_health_enabled, false, I know I was the one who told you to add this, but I realized we already have a flag to turn this off if problems occur (--blacklisting_enabled) and we probably don't need to have separate flags for individually turning on and off each blacklisting scenario. Plus, with the new error code stuff I feel more comfortable that this patch isn't going to have unintended consequences. If you would still prefer to keep it for now, I would recommend marking it as "experimental" rather than "advanced" as that gives us more room to remove it in the future. http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/tmp-file-mgr.cc File be/src/runtime/tmp-file-mgr.cc: http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/tmp-file-mgr.cc@679 PS12, Line 679: if (ErrorConverter::IsBlacklistableError(status)) { So we're not considering 'num_blacklisted_files_' here. I guess the point is that for writes a single failure doesn't necessarily fail the query, since we can try to write to the other files, but for reads a single failure does fail the overall query, cause we only wrote this once, so there's no where else to try reading it from? I'm concerned that may make us a bit too aggressive about blacklisting, so maybe we should only do the blacklisting on the write path? After all, if a read fails because the disk really is completely faulty, we'll figure that out on the next query that spills when it can't do any writing. http://gerrit.cloudera.org:8080/#/c/16949/12/be/src/runtime/tmp-file-mgr.cc@797 PS12, Line 797: if (ErrorConverter::IsBlacklistableError(err)) { I think the logic here is that if a single one of the errors that we got was blacklistable then we blacklist it, but do we maybe want it to be that we require that all errors are blacklistable before blacklisting. If any one of the errors was recoverable (i.e. not blacklistable) then seems like our expectation is that future queries really could succeed. http://gerrit.cloudera.org:8080/#/c/16949/12/tests/custom_cluster/test_blacklist.py File tests/custom_cluster/test_blacklist.py: http://gerrit.cloudera.org:8080/#/c/16949/12/tests/custom_cluster/test_blacklist.py@326 PS12, Line 326: assert results.success I think you should be able to check that the runtime profile contains the "Blacklisted Executors" line with the corresponding backend. http://gerrit.cloudera.org:8080/#/c/16949/12/tests/custom_cluster/test_blacklist.py@328 PS12, Line 328: client.close_query(handle) Maybe add a sleep that's long enough to see the BE get removed from the blacklist and then run another query that doesn't have the debug action and show that the profile shows it running on all 3 backends. -- To view, visit http://gerrit.cloudera.org:8080/16949 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I04bfcb7f2e0b1ef24a5b4350f270feecd8c47437 Gerrit-Change-Number: 16949 Gerrit-PatchSet: 12 Gerrit-Owner: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Comment-Date: Fri, 29 Jan 2021 19:11:07 +0000 Gerrit-HasComments: Yes
