[ 
https://issues.apache.org/jira/browse/PIG-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15309940#comment-15309940
 ] 

liyunzhang_intel commented on PIG-4893:
---------------------------------------

Here summary the reason why task deserialization time is too long:
 we add all dependency jars under $PIG_HOME/lib/ and $PIG_HOME/lib/spark/ to 
$SPARK_JARS, spark will ship all these jars to hadoop distributed cache. Yarn 
container will download all these jars when deserializing a 
job([org.apache.spark.executor.Executor#updateDependencies|https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/executor/Executor.scala#L424].
  

After removing some big dependencies in $PIG_HOME/lib/( such as 
jython-standalone-2.5.3.jar,jruby-complete-1.6.7.jar and so on, we don't need 
these jars when running a simple pig script), the deserialization time is 
reduced from 12s to 4s. So do we need ship all the jars under $PIG_HOME/lib/* 
every time even though some jars actually are not needed? 



> Task deserialization time is too long for spark on yarn mode
> ------------------------------------------------------------
>
>                 Key: PIG-4893
>                 URL: https://issues.apache.org/jira/browse/PIG-4893
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>             Fix For: spark-branch
>
>         Attachments: time.PNG
>
>
> I found the task deserialization time is a bit long when i run any scripts of 
> pigmix in spark on yarn mode.  see the attachment picture.  The duration time 
> is 3s but the task deserialization is 13s.  
> My env is hadoop2.6+spark1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to