[
https://issues.apache.org/jira/browse/LIVY-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634150#comment-17634150
]
Jeff Xu commented on LIVY-896:
------------------------------
[~lmccay] This fix should be in release as early as possible because when it's
hit, the result is really bad. On some AWS EMRs, our production code hit this
issue very frequently (~50% chance in one particular EMR).
> Livy could intermittently returns batch as SUCCEED even Spark on Yarn
> actually fails
> ------------------------------------------------------------------------------------
>
> Key: LIVY-896
> URL: https://issues.apache.org/jira/browse/LIVY-896
> Project: Livy
> Issue Type: Bug
> Components: Server
> Reporter: Jeff Xu
> Priority: Critical
> Fix For: 0.9.0
>
> Attachments: livy_batch.py, livy_client.py
>
> Time Spent: 2h 40m
> Remaining Estimate: 0h
>
> *Summary:*
> * I ran into this issue using AWS EMR.
> * Frequency of the issue varies. On one EMR cluster, I typically see ~10-20%
> chance of hitting the issue. But on another EMR cluster, the chance is ~1%. I
> suspect the chance depends on how busy AWS hardware actually was (my EMR
> likely share hardware resources with other AWS tenants).
> * I believe that I have identify the root cause in Livy source code (refer
> to a later section).
>
> *How to reproduce:*
> * An EMR with Spark, Yarn and Livy configured.
> * Use the attached livy_batch.py to trigger a Livy batch by using livy
> python client (0.8.0). See attached livy_client.py.
> * Repeat the testing and you should see when the issue happens, even though
> the spark program errors out, Livy still reports the batch as SUCCEED.
>
> *Livy log for a good case when Livy returns batch as DEAD (expected
> behavior):*
> 22/10/14 02:46:22 INFO BatchSessionManager: Registered new session 1
> 22/10/14 02:46:42 DEBUG BatchSession: BatchSession 1 state changed from
> STARTING to RUNNING
> 22/10/14 02:46:43 WARN BatchSession$: spark-submit exited with code 1
> 22/10/14 02:46:47 DEBUG BatchSession: BatchSession 1 state changed from
> RUNNING to FINISHED
> 22/10/14 02:46:47 DEBUG BatchSession: BatchSession 1 state changed from
> FINISHED to FAILED
>
> *Livy log for bad case when Livy returns batch as SUCCEED (bug):*
> 22/10/14 02:47:40 INFO BatchSessionManager: Registered new session 3
> 22/10/14 02:48:00 DEBUG BatchSession: BatchSession 3 state changed from
> STARTING to FINISHED
> 22/10/14 02:48:01 WARN BatchSession$: spark-submit exited with code 1
>
> *Root cause analysis:*
> * I think the bug is in the YarnAppMonitorThread of Livy server.
> * The bug is in this section:
> [https://github.com/apache/incubator-livy/blob/v0.7.1-incubating-rc1/server/src/main/scala/org/apache/livy/utils/SparkYarnApp.scala#L283-L299]
> * When the right timing happens,
> ** Yarn sees the application completed
> ** But spark-submit process is still running
> * So line 286 returns app report saying the application completes
> successfully (Yarn has not context of the Spark application succeeds or not).
> * Line 288 set the Livy session as succeed.
> * Line 293 returns false as the spark-submit is still running.
> * So we hit the bug.
> Even without hitting the timing condition, the code logic itself is {*}still
> incorrect{*}.
> If you take a look at the log from a "good" case, the session state was
> updated twice: FINISHED, then FAILED. If a client query arrives on the
> perfect timing, the livy server could can still return a wrong state.
> {noformat}
> 22/10/14 02:46:47 DEBUG BatchSession: BatchSession 1 state changed from
> RUNNING to FINISHED
> 22/10/14 02:46:47 DEBUG BatchSession: BatchSession 1 state changed from
> FINISHED to FAILED{noformat}
>
> I hope we can work together to have the issue addressed ASAP as the bug hit
> our production code pretty bad. I think the right code logic should be:
> # read the spark-submit process's state, if still running, do nothing
> # If the spark-submit process finishes, read Yarn report, and determines the
> actual application finish state in a single shot.
> # Update the session state in a single step.
>
> At the same time, I will see if I can create a PR with suggested fix soon.
> The challenge on my side is that it's almost impossible for me to swap a few
> jars from open-source code base on AWS EMR (not compatible with EMR runtime).
>
> Thank you, Livy team!
> Regards,
> Jeff Xu, a Workday engineer
--
This message was sent by Atlassian Jira
(v8.20.10#820010)