vanzin commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a Hadoop Compatible File System URL: https://github.com/apache/spark/pull/23546#discussion_r279107963
########## File path: docs/running-on-kubernetes.md ########## @@ -208,8 +208,30 @@ If your application's dependencies are all hosted in remote locations like HDFS by their appropriate remote URIs. Also, application dependencies can be pre-mounted into custom-built Docker images. Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles. The `local://` scheme is also required when referring to -dependencies in custom-built Docker images in `spark-submit`. Note that using application dependencies from the submission -client's local file system is currently not yet supported. +dependencies in custom-built Docker images in `spark-submit`. We support dependencies from the submission +client's local file system using the `file://` scheme or without a scheme (using a full path), where the destination should be a Hadoop compatible filesystem. +A typical example of this using S3 is via passing the following options: + +``` +... +--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.6 +--conf spark.kubernetes.file.upload.path=s3a://<s3-bucket>/path +--conf spark.hadoop.fs.s3a.access.key=... +--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem +--conf spark.hadoop.fs.s3a.fast.upload=true +--conf spark.hadoop.fs.s3a.secret.key=.... +--conf spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp +file:///full/path/to/app.jar +``` +The app jar file will be uploaded to the S3 and then when the driver is launched it will be downloaded +to the driver pod and will be added to its classpath. + +The client scheme is supported for the application jar, and dependencies specified by properties `spark.jars` and `spark.files`. + +Important: all client-side dependencies will be uploaded to the given path with a flat directory structure so Review comment: One thing that I'm missing after reading this again is how is cleanup handled? What about multiple applications being started concurrently? Seems like you'll end up with a directory full or jars from different applications, and two of them decide to upload "app.jar", one of them might fail because it's using the wrong jar or a partially written one (because the second submission is still happening, e.g.). Wouldn't it be better to use a submission-specific path that Spark creates, and have the driver make a best effort to clean it up after it downloads the dependencies? Similar to what Spark does on YARN. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
