Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )
Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails ...................................................................... Patch Set 12: > Depends on the particular RPC. When we blacklist for a failed > Exec() rpc, it will never have been retried. For TransmitData(), it > will only be retried if the error that came back is "server too > busy" How about connection / connectivity errors? If the connection is broken, do we make any attempt to retry connection establishment or do we just fail? > If auto-scaling takes into account blacklisting, I would probably > say that's the wrong approach and we should think about changing it > to only consider the statestore view of cluster membership. Well if a node is blacklisted, no queries get scheduled on that node right? So functionally, the node isn't doing any useful work, so it might make sense for the auto-scaler to think it is "dead" and no longer part of the executor group. Perhaps a better approach would be to only consider a node as "dead" if it has been blacklisted multiple times. I think this is a larger discussion though, we haven't spent enough time thinking through node blacklisting + auto-scaler integration. I think we should, but probably as a separate item. > Definitely agree that we need to put some work into realistic fault > testing, and not just rely on our intuition about blacklisting/retry > policy. Filed IMPALA-9253 for this. > I'm definitely interested in approaches to retries that don't > require marking statuses retriable (obviously I've had a lot of > concerns with that approach). I could see some problems with this > approach too, such as that we may have situations in the future > where we want to retry queries without doing any blacklisting (eg. > if we think there was stale metadata and replanning the query will > let it succeed) Yeah, its possible in the future that we might want to retry queries even if a node hasn't been blacklisted, but I don't see a strong use case for that type of pattern right now. In the interest of not over-engineering the initial solution, I'm fine with just making retries blacklist / cluster membership driven. > or probably worse - that you could have a query > that fails due to a failed rpc and reports that the node should be > blacklisted, and then also hits another error that isn't retry-able > but we retry it anyways. Not sure how common a situation like that > would be, though. Definitely possible, but a bit of an edge case. A query would have to fail for two separate reasons at approximately the same time: an error that triggers a blacklist, and an error that doesn't. Whenever the Coordinator gets an error, it cancels the rest of the query, so the non-blacklist error has to occur before the cancellation is processed. Its something we should fix, but I think we can defer out of the initial implementation. Filed IMPALA-9124 as a follow up. The fix would probably be to ensure that all failed fragments trigger a blacklist or failed because they were cancelled. -- To view, visit http://gerrit.cloudera.org:8080/14677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c Gerrit-Change-Number: 14677 Gerrit-PatchSet: 12 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Comment-Date: Mon, 16 Dec 2019 21:30:07 +0000 Gerrit-HasComments: No
