Github user chemikadze commented on a diff in the pull request: https://github.com/apache/incubator-griffin/pull/421#discussion_r219724832 --- Diff: service/src/main/java/org/apache/griffin/core/util/YarnNetUtil.java --- @@ -56,6 +62,14 @@ public static boolean update(String url, JobInstanceBean instance) { instance.setState(LivySessionStates.toLivyState(state)); } return true; + } catch (HttpClientErrorException e) { + LOGGER.warn("client error {} from yarn: {}", + e.getMessage(), e.getResponseBodyAsString()); + if (e.getStatusCode() == HttpStatus.NOT_FOUND) { + // in sync with Livy behavior, see com.cloudera.livy.utils.SparkYarnApp + instance.setState(DEAD); --- End diff -- Only 404 is handled here, which should not be result of network issue. It looks like any kind of error reported by Yarn client (after internal retries) results in DEADing job on Livy side: https://github.com/cloudera/livy/blob/master/server/src/main/scala/com/cloudera/livy/utils/SparkYarnApp.scala#L307 I'll need to double check whether not found applications are ever getting retried, to make sure behavior is same as on Livy side. If not -- then that's what Livy would do.
---