[
https://issues.apache.org/jira/browse/PIG-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15309940#comment-15309940
]
liyunzhang_intel commented on PIG-4893:
---------------------------------------
Here summary the reason why task deserialization time is too long:
we add all dependency jars under $PIG_HOME/lib/ and $PIG_HOME/lib/spark/ to
$SPARK_JARS, spark will ship all these jars to hadoop distributed cache. Yarn
container will download all these jars when deserializing a
job([org.apache.spark.executor.Executor#updateDependencies|https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/executor/Executor.scala#L424].
After removing some big dependencies in $PIG_HOME/lib/( such as
jython-standalone-2.5.3.jar,jruby-complete-1.6.7.jar and so on, we don't need
these jars when running a simple pig script), the deserialization time is
reduced from 12s to 4s. So do we need ship all the jars under $PIG_HOME/lib/*
every time even though some jars actually are not needed?
> Task deserialization time is too long for spark on yarn mode
> ------------------------------------------------------------
>
> Key: PIG-4893
> URL: https://issues.apache.org/jira/browse/PIG-4893
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: time.PNG
>
>
> I found the task deserialization time is a bit long when i run any scripts of
> pigmix in spark on yarn mode. see the attachment picture. The duration time
> is 3s but the task deserialization is 13s.
> My env is hadoop2.6+spark1.6.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)