[jira] [Commented] (SPARK-47959) Improve GET_JSON_OBJECT performance on executors running multiple tasks

Tatu Saloranta (Jira) Fri, 26 Apr 2024 13:27:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-47959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841355#comment-17841355
 ]


Tatu Saloranta commented on SPARK-47959:
----------------------------------------

Aside from question of reducing contention in InternCache, my experience has 
been that if this blocking is hit there is always some other problem involved: 
either unbounded number of keys (like UUID keys) or lack of `JsonFactory` 
reuse. In latter case the best solution is to try to use JsonFactory (whether 
directly or by reusing ObjectMapper that owns it); in former case (or, as 2nd 
alternative for latter case), there are 2 `JsonFactory.Feature` settings that 
may be disabled:
 * JsonFactory.Feature.INTERN_FIELD_NAMES: if names are not reused across 
reads, there is little value in String.intern()
 * JsonFactory.Feature.CANONICALIZE_FIELD_NAMES: ... or if there's no reuse nor 
repeating symbols, the whole canonicalization can be disabled.

and so it may be worth experimenting with these settings (disabling one or the 
other: if CANONICALIZE_FIELD_NAMES disabled INTERN_FIELD_NAMES does not matter).

Put another way: while there is some value in improving locking of 
`InternCache`, it is unlikely to be the most effective solution to whatever 
problem there is.

> Improve GET_JSON_OBJECT performance on executors running multiple tasks
> -----------------------------------------------------------------------
>
>                 Key: SPARK-47959
>                 URL: https://issues.apache.org/jira/browse/SPARK-47959
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.5.1
>            Reporter: Zheng Shao
>            Priority: Major
>
> We have a Spark executor that is running 32 workers in parallel.  The query 
> is a simple SELECT with several `GET_JSON_OBJECT` UDF calls.
> We noticed that 80+% of the stacktrace of the worker threads are blocked on 
> the following stacktrace:
>  
> {code:java}
> com.fasterxml.jackson.core.util.InternCache.intern(InternCache.java:50) - 
> blocked on java.lang.Object@7529fde1 
> com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer.addName(ByteQuadsCanonicalizer.java:947)
>  
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.addName(UTF8StreamJsonParser.java:2482)
>  
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.findName(UTF8StreamJsonParser.java:2339)
>  
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.parseMediumName(UTF8StreamJsonParser.java:1870)
>  
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1825)
>  
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:798)
>  
> com.fasterxml.jackson.core.base.ParserMinimalBase.skipChildren(ParserMinimalBase.java:240)
>  
> org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:383)
>  
> org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:287)
>  
> org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4(jsonExpressions.scala:198)
>  
> org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4$adapted(jsonExpressions.scala:196)
>  
> org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase$$Lambda$8585/1316745697.apply(Unknown
>  Source)
> ...
> {code}
>  
> Apparently jackson-core has such a performance bug from version 2.3 - 2.15, 
> and not fixed until version 2.18 (unreleased): 
> [https://github.com/FasterXML/jackson-core/blob/fc51d1e13f4ba62a25a739f26be9e05aaad88c3e/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L50]
>  
> {code:java}
>             synchronized (lock) {
>                 if (size() >= MAX_ENTRIES) {
>                     clear();
>                 }
>             }
> {code}
>  
> instead of 
> [https://github.com/FasterXML/jackson-core/blob/8b87cc1a96f649a7e7872c5baa8cf97909cabf6b/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L59]
>  
> {code:java}
>             /* As of 2.18, the limit is not strictly enforced, but we do try 
> to
>              * clear entries if we have reached the limit. We do not expect to
>              * go too much over the limit, and if we do, it's not a huge 
> problem.
>              * If some other thread has the lock, we will not clear but the 
> lock should
>              * not be held for long, so another thread should be able to 
> clear in the near future.
>              */
>             if (lock.tryLock()) {
>                 try {
>                     if (size() >= DEFAULT_MAX_ENTRIES) {
>                         clear();
>                     }
>                 } finally {
>                     lock.unlock();
>                 }
>             }   {code}
>  
> Potential fixes:
>  # Upgrade to Jackson-core 2.18 when it's released;
>  # Follow [https://github.com/FasterXML/jackson-core/issues/998] - I don't 
> totally understand the options suggested by this thread yet.
>  # Introduce a new UDF that doesn't depend on jackson-core



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-47959) Improve GET_JSON_OBJECT performance on executors running multiple tasks

Reply via email to