[
https://issues.apache.org/jira/browse/SPARK-47959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841339#comment-17841339
]
PJ Fanning commented on SPARK-47959:
------------------------------------
[~zshao] if you have a test environment, could you try it with the
2.18.0-SNAPSHOT Jackson jars to see if they halp?
> Improve GET_JSON_OBJECT performance on executors running multiple tasks
> -----------------------------------------------------------------------
>
> Key: SPARK-47959
> URL: https://issues.apache.org/jira/browse/SPARK-47959
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.5.1
> Reporter: Zheng Shao
> Priority: Major
>
> We have a Spark executor that is running 32 workers in parallel. The query
> is a simple SELECT with several `GET_JSON_OBJECT` UDF calls.
> We noticed that 80+% of the stacktrace of the worker threads are blocked on
> the following stacktrace:
>
> {code:java}
> com.fasterxml.jackson.core.util.InternCache.intern(InternCache.java:50) -
> blocked on java.lang.Object@7529fde1
> com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer.addName(ByteQuadsCanonicalizer.java:947)
>
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.addName(UTF8StreamJsonParser.java:2482)
>
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.findName(UTF8StreamJsonParser.java:2339)
>
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.parseMediumName(UTF8StreamJsonParser.java:1870)
>
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1825)
>
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:798)
>
> com.fasterxml.jackson.core.base.ParserMinimalBase.skipChildren(ParserMinimalBase.java:240)
>
> org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:383)
>
> org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:287)
>
> org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4(jsonExpressions.scala:198)
>
> org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4$adapted(jsonExpressions.scala:196)
>
> org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase$$Lambda$8585/1316745697.apply(Unknown
> Source)
> ...
> {code}
>
> Apparently jackson-core has such a performance bug from version 2.3 - 2.15,
> and not fixed until version 2.18 (unreleased):
> [https://github.com/FasterXML/jackson-core/blob/fc51d1e13f4ba62a25a739f26be9e05aaad88c3e/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L50]
>
> {code:java}
> synchronized (lock) {
> if (size() >= MAX_ENTRIES) {
> clear();
> }
> }
> {code}
>
> instead of
> [https://github.com/FasterXML/jackson-core/blob/8b87cc1a96f649a7e7872c5baa8cf97909cabf6b/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L59]
>
> {code:java}
> /* As of 2.18, the limit is not strictly enforced, but we do try
> to
> * clear entries if we have reached the limit. We do not expect to
> * go too much over the limit, and if we do, it's not a huge
> problem.
> * If some other thread has the lock, we will not clear but the
> lock should
> * not be held for long, so another thread should be able to
> clear in the near future.
> */
> if (lock.tryLock()) {
> try {
> if (size() >= DEFAULT_MAX_ENTRIES) {
> clear();
> }
> } finally {
> lock.unlock();
> }
> } {code}
>
> Potential fixes:
> # Upgrade to Jackson-core 2.18 when it's released;
> # Follow [https://github.com/FasterXML/jackson-core/issues/998] - I don't
> totally understand the options suggested by this thread yet.
> # Introduce a new UDF that doesn't depend on jackson-core
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]