[ 
https://issues.apache.org/jira/browse/PIG-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297618#comment-15297618
 ] 

liyunzhang_intel commented on PIG-4667:
---------------------------------------

[~sriksun]:  Now community is reviewing the pig on spark, following is part of 
the [feedback|https://reviews.apache.org/r/45667/#review134255] from the 
community
about following code in bin/pig:
{code}
################# ADDING SPARK DEPENDENCIES ##################
# Spark typically works with a single assembly file. However this
# assembly isn't available as a artifact to pull in via ivy.
# To work around this short coming, we add all the jars barring
# spark-yarn to DIST through dist-files and then add them to classpath
# of the executors through an independent env variable. The reason
# for excluding spark-yarn is because spark-yarn is already being added
# by the spark-yarn-client via jarOf(Client.Class)

for f in $PIG_HOME/lib/spark/*.jar; do
    if [[ $f == $PIG_HOME/lib/spark/spark-yarn* ]]; then
        # Exclude spark-yarn.jar from shipped jars, but retain in classpath
        SPARK_JARS=${SPARK_JARS}:$f;
    else
        SPARK_JARS=${SPARK_JARS}:$f;
        SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
        SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
    fi
done

for f in $PIG_HOME/lib/*.jar; do
    SPARK_JARS=${SPARK_JARS}:$f;
    SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},file://$f;
    SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:\${PWD}/`basename $f`
done
CLASSPATH=${CLASSPATH}:${SPARK_JARS}

#SPARK_YARN_DIST_FILES=${SPARK_YARN_DIST_FILES},$PIG_HOME/lib/spark-assembly-1.6.0-hadoop2.6.0.jar
#SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:$PIG_HOME/lib/spark-assembly-1.6.0-hadoop2.6.0.jar
export SPARK_YARN_DIST_FILES=`echo ${SPARK_YARN_DIST_FILES} | sed 's/^,//g'`
export SPARK_JARS=${SPARK_YARN_DIST_FILES}
export SPARK_DIST_CLASSPATH
################# ADDING SPARK DEPENDENCIES ##################
{code}
Rohini left some comment:
{quote}
   This is not a good idea. If I remember correctly, spark-assembly.jar is 
128MB+. If you are copying all the individual jars that it is made up of to 
distcache for every job, it will suffer bad performance as copy to hdfs and 
localization by NM will be very costly. 
     Like Tez you can have users copy the assembly jar to hdfs and specify the 
hdfs location. This will ensure there is only one copy in hdfs and localization 
is done only once per node by node manager.
{quote}

Can we replace all jars in $PIG_HOME/lib/spark/ with spark-assembly.jar if we 
let end-users to copy spark-assemly.jar to $PIG_HOME/lib/ not download all the 
dependency jars from ivy?




> Enable Pig on Spark to run on Yarn Client mode
> ----------------------------------------------
>
>                 Key: PIG-4667
>                 URL: https://issues.apache.org/jira/browse/PIG-4667
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Srikanth Sundarrajan
>            Assignee: Srikanth Sundarrajan
>             Fix For: spark-branch
>
>         Attachments: PIG-4667-logs.tgz, PIG-4667-v1.patch, PIG-4667.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to