[
https://issues.apache.org/jira/browse/SPARK-25174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kent Yao updated SPARK-25174:
-----------------------------
Description:
We recently ran into SPARK-18016 which has been fixed in v2.3.0. This JIRA is
not about the issue in SPARK-18016 but the side-effect which it brings. When
SPARK-18016 occurs, ApplicationMaster fails unregistering itself because the
exception contains extreme large error information.
{code:java}
ERROR yarn.ApplicationMaster: User class threw exception:
java.lang.RuntimeException: Error while decoding:
java.util.concurrent.ExecutionException: java.lang.Exception: failed to
compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown
past JVM limit of 0xFFFF
/* 001 */ public java.lang.Object generate(Object[] references) {
....
/* 395656 */ mutableRow.update(0, value);
/* 395657 */ }
/* 395658 */
/* 395659 */ return mutableRow;
/* 395660 */ }
/* 395661 */ }
{code}
The above codegen text is included in the final message for AM to wave goodbye
to RM, while it ends up crashing the rm's ZKRMStateStore for YARN-6125 not
covering the unregisterApplicationMaster's message truncation. We also create
an Jira on YARN Side https://issues.apache.org/jira/browse/YARN-8691
Although SPARK-18016 fixed already, there are maybe other uncaught exceptions
will cause this problem. I guess that we should limit the error message's size
sent to RM while unregistering .
was:
We recently ran into SPARK-18016 which has been fixed in v2.3.0. This JIRA is
not about the issue in SPARK-18016 but the side-effect which it brings. When
SPARK-18016 occurs, ApplicationMaster fails unregistering itself because the
exception contains extreme large error information.
{code:java}
ERROR yarn.ApplicationMaster: User class threw exception:
java.lang.RuntimeException: Error while decoding:
java.util.concurrent.ExecutionException: java.lang.Exception: failed to
compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown
past JVM limit of 0xFFFF
/* 001 */ public java.lang.Object generate(Object[] references) {
....
/* 395656 */ mutableRow.update(0, value);
/* 395657 */ }
/* 395658 */
/* 395659 */ return mutableRow;
/* 395660 */ }
/* 395661 */ }
{code}
The above codegen text is included in the final message for AM to wave goodbye
to RM, while it ends up crashing the rm's ZKRMStateStore for YARN-6125 not
covering the unregisterApplicationMaster's message truncation. We also create
an Jira on YARN Side https://issues.apache.org/jira/browse/YARN-8691
Although SPARK-18016 fixed already, there are maybe other uncaught exception
will cause this problem. I guess that we should limit the error message's size
sent to RM while unregistering .
> ApplicationMaster suspends when unregistering itself from RM with extreme
> large diagnostic message
> --------------------------------------------------------------------------------------------------
>
> Key: SPARK-25174
> URL: https://issues.apache.org/jira/browse/SPARK-25174
> Project: Spark
> Issue Type: Bug
> Components: YARN
> Affects Versions: 2.1.1
> Reporter: Kent Yao
> Priority: Major
>
> We recently ran into SPARK-18016 which has been fixed in v2.3.0. This JIRA is
> not about the issue in SPARK-18016 but the side-effect which it brings. When
> SPARK-18016 occurs, ApplicationMaster fails unregistering itself because the
> exception contains extreme large error information.
> {code:java}
> ERROR yarn.ApplicationMaster: User class threw exception:
> java.lang.RuntimeException: Error while decoding:
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to
> compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown
> past JVM limit of 0xFFFF
> /* 001 */ public java.lang.Object generate(Object[] references) {
> ....
> /* 395656 */ mutableRow.update(0, value);
> /* 395657 */ }
> /* 395658 */
> /* 395659 */ return mutableRow;
> /* 395660 */ }
> /* 395661 */ }
> {code}
> The above codegen text is included in the final message for AM to wave
> goodbye to RM, while it ends up crashing the rm's ZKRMStateStore for
> YARN-6125 not covering the unregisterApplicationMaster's message truncation.
> We also create an Jira on YARN Side
> https://issues.apache.org/jira/browse/YARN-8691
> Although SPARK-18016 fixed already, there are maybe other uncaught exceptions
> will cause this problem. I guess that we should limit the error message's
> size sent to RM while unregistering .
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]