GitHub user jerryshao opened a pull request:

    https://github.com/apache/spark/pull/12203

    [SPARK-14423][YARN] Avoid same name files added to distributed cache again

    ## What changes were proposed in this pull request?
    
    In the current implementation of assembly-free spark deployment, jars under 
`assembly/target/scala-xxx/jars` will be uploaded to distributed cache by 
default, there's a chance these jars' name will be conflicted with name of jars 
specified in `--jars`, this will introduce exception when starting application:
    
    ```
    client token: N/A
         diagnostics: Application application_1459907402325_0004 failed 2 times 
due to AM Container for appattempt_1459907402325_0004_000002 exited with  
exitCode: -1000
    For more detailed output, check application tracking 
page:http://hw12100.local:8088/proxy/application_1459907402325_0004/Then, click 
on links to logs of each attempt.
    Diagnostics: Resource 
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar
 changed on src filesystem (expected 1459909780508, was 1459909782590
    java.io.IOException: Resource 
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar
 changed on src filesystem (expected 1459909780508, was 1459909782590
        at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
        at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    ```
    
    So here by checking the name of file to avoid same name files uploaded 
again.
    
    ## How was this patch tested?
    
    Unit test and manual integrated test is done locally.
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jerryshao/apache-spark SPARK-14423

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12203.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12203
    
----
commit 7ff58be3ab8ae2d1768a53245bbe0be01014a211
Author: jerryshao <[email protected]>
Date:   2016-04-06T09:48:52Z

    avoid same name files added to distributed cache again

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to