[
https://issues.apache.org/jira/browse/PIG-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320250#comment-15320250
]
liyunzhang_intel commented on PIG-4893:
---------------------------------------
[~sriksun] and [~pallavi.rao]:
In current spark branch, the task deserialization is a bit long because we
append all jars under $PIG_HOME/lib/ and $PIG_HOME/lib/Spark to
SPARK_YARN_DIST_FILES and spark client will upload all these jars to hdfs and
yarn container spends some time to download these jar when deserializationing
the task. In PIG-4903_2.patch, we don't append all jars to
SPARK_YARN_DIST_FILES bin/pig. In PIG-4893.patch, we dynamically distribute
cache necessary jars to hdfs by using SparkContext.addJar() which will upload
jar to the hdfs so that yarn container can access them later.
Because both of you are familiar with this part, please help review and i will
put the patch to the final patch of spark branch when merging with trunk.
SparkContext.addJar
{code}
/**
* Adds a JAR dependency for all tasks to be executed on this SparkContext in
the future.
* The `path` passed can be either a local file, a file in HDFS (or other
Hadoop-supported
* filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on
every worker node.
*/
def addJar(path: String) {
...
}
{code}
PIG-4893.patch is based on eab9180 in spark branch.
> Task deserialization time is too long for spark on yarn mode
> ------------------------------------------------------------
>
> Key: PIG-4893
> URL: https://issues.apache.org/jira/browse/PIG-4893
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4893.patch, time.PNG
>
>
> I found the task deserialization time is a bit long when i run any scripts of
> pigmix in spark on yarn mode. see the attachment picture. The duration time
> is 3s but the task deserialization is 13s.
> My env is hadoop2.6+spark1.6.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)