GitHub user jerryshao opened a pull request:
https://github.com/apache/spark/pull/12203
[SPARK-14423][YARN] Avoid same name files added to distributed cache again
## What changes were proposed in this pull request?
In the current implementation of assembly-free spark deployment, jars under
`assembly/target/scala-xxx/jars` will be uploaded to distributed cache by
default, there's a chance these jars' name will be conflicted with name of jars
specified in `--jars`, this will introduce exception when starting application:
```
client token: N/A
diagnostics: Application application_1459907402325_0004 failed 2 times
due to AM Container for appattempt_1459907402325_0004_000002 exited with
exitCode: -1000
For more detailed output, check application tracking
page:http://hw12100.local:8088/proxy/application_1459907402325_0004/Then, click
on links to logs of each attempt.
Diagnostics: Resource
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar
changed on src filesystem (expected 1459909780508, was 1459909782590
java.io.IOException: Resource
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar
changed on src filesystem (expected 1459909780508, was 1459909782590
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```
So here by checking the name of file to avoid same name files uploaded
again.
## How was this patch tested?
Unit test and manual integrated test is done locally.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jerryshao/apache-spark SPARK-14423
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12203.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12203
----
commit 7ff58be3ab8ae2d1768a53245bbe0be01014a211
Author: jerryshao <[email protected]>
Date: 2016-04-06T09:48:52Z
avoid same name files added to distributed cache again
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]