Re: How to support dependency jars and files on HDFS in standalone cluster mode?

Cheng Lian Wed, 10 Jun 2015 23:32:28 -0700

Oh sorry, I mistook --jars for --files. Yeah, for jars we need to addthem to classpath, which is different from regular files.


Cheng


On 6/11/15 2:18 PM, Dong Lei wrote:


Thanks Cheng,

If I do not use --jars how can I tell spark to search the jars(andfiles) on HDFS?

Do you mean the driver will not need to setup a HTTP file server forthis scenario and the worker will fetch the jars and files from HDFS?


Thanks

Dong Lei

*From:*Cheng Lian [mailto:lian.cs....@gmail.com]
*Sent:* Thursday, June 11, 2015 12:50 PM
*To:* Dong Lei; dev@spark.apache.org
*Cc:* Dianfei (Keith) Han

*Subject:* Re: How to support dependency jars and files on HDFS instandalone cluster mode?

Since the jars are already on HDFS, you can access them directly inyour Spark application without using --jars


Cheng

On 6/11/15 11:04 AM, Dong Lei wrote:

    Hi spark-dev:

    I can not use a hdfs location for the “--jars” or “--files” option
    when doing a spark-submit in a standalone cluster mode. For example:

                    Spark-submit  …   --jars hdfs://ip/1.jar  ….
     hdfs://ip/app.jar (standalone cluster mode)

    will not download 1.jar to driver’s http file server(but the
    app.jar will be downloaded to the driver’s dir).

    I figure out the reason spark not downloading the jars is that
    when doing sc.addJar to http file server, the function called is
    Files.copy which does not support a remote location.

    And I think if spark can download the jars and add them to http
    file server, the classpath is not correctly set, because the
    classpath contains remote location.

    So I’m trying to make it work and come up with two options, but
    neither of them seem to be elegant, and I want to hear your advices:

    Option 1:

    Modify HTTPFileServer.addFileToDir, let it recognize a “hdfs” prefix.

    This is not good because I think it breaks the scope of http file
    server.

    Option 2:

    Modify DriverRunner.downloadUserJar, let it download all the
    “--jars” and “--files” with the application jar.

    This sounds more reasonable that option 1 for downloading files.
    But this way I need to read the “spark.jars” and “spark.files” on
    downloadUserJar or DriverRunnder.start and replace it with a local
    path. How can I do that?

    Do you have a more elegant solution, or do we have a plan to
    support it in the furture?

    Thanks

    Dong Lei

Re: How to support dependency jars and files on HDFS in standalone cluster mode?

Reply via email to