ocworld opened a new pull request, #38828:
URL: https://github.com/apache/spark/pull/38828

   ### What changes were proposed in this pull request?
   Supporting '--packages' in the k8s cluster mode
   
   ### Why are the changes needed?
   In spark 3, '--packages' in the k8s cluster mode is not supported. I 
expected that managing dependencies by using packages like spark 2.
   
   Spark 2.4.5
   
   
https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
   
   ```scala
        if (!isMesosCluster && !isStandAloneCluster) {
         // Resolve maven dependencies if there are any and add classpath to 
jars. Add them to py-files
         // too for packages that include Python code
         val resolvedMavenCoordinates = 
DependencyUtils.resolveMavenDependencies(
           args.packagesExclusions, args.packages, args.repositories, 
args.ivyRepoPath,
           args.ivySettingsPath)
         
         if (!StringUtils.isBlank(resolvedMavenCoordinates)) {
           args.jars = mergeFileLists(args.jars, resolvedMavenCoordinates)
           if (args.isPython || isInternal(args.primaryResource)) {
             args.pyFiles = mergeFileLists(args.pyFiles, 
resolvedMavenCoordinates)
           }
         } 
         
         // install any R packages that may have been passed through --jars or 
--packages.
         // Spark Packages may contain R source code inside the jar.
         if (args.isR && !StringUtils.isBlank(args.jars)) {
           RPackageUtils.checkAndBuildRPackage(args.jars, printStream, 
args.verbose)
         }
       } 
    ```
   
   Spark 3.0.2
   
   
https://github.com/apache/spark/blob/v3.0.2/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
   
   ```scala
          if (!StringUtils.isBlank(resolvedMavenCoordinates)) {
           // In K8s client mode, when in the driver, add resolved jars early 
as we might need
           // them at the submit time for artifact downloading.
           // For example we might use the dependencies for downloading
           // files from a Hadoop Compatible fs eg. S3. In this case the user 
might pass:
           // --packages 
com.amazonaws:aws-java-sdk:1.7.4:org.apache.hadoop:hadoop-aws:2.7.6
           if (isKubernetesClusterModeDriver) {
             val loader = getSubmitClassLoader(sparkConf)
             for (jar <- resolvedMavenCoordinates.split(",")) {
               addJarToClasspath(jar, loader)
             }
           } else if (isKubernetesCluster) {
             // We need this in K8s cluster mode so that we can upload local 
deps
             // via the k8s application, like in cluster mode driver
             childClasspath ++= resolvedMavenCoordinates.split(",")
           } else {
             args.jars = mergeFileLists(args.jars, resolvedMavenCoordinates)
             if (args.isPython || isInternal(args.primaryResource)) {
               args.pyFiles = mergeFileLists(args.pyFiles, 
resolvedMavenCoordinates)
             }
           }
         }
   ```
   
   unlike spark2, in spark 3, jars are not added in any place.
   
   ### Does this PR introduce _any_ user-facing change?
   Unlike spark 2, resolved jars are added not in cluster mode spark submit but 
in driver.
   
   It's because in spark 3, the feature is added that is uploading jars with 
prefix "file://" to s3.
   So, if resolved jars are added in spark submit, every jars from packages are 
uploading to s3! When I tested it, it is very bad experience to me.
   
   ### How was this patch tested?
   In my k8s environment, i tested the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to