[jira] [Commented] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3

Steve Loughran (Jira) Mon, 18 Jan 2021 08:27:07 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267398#comment-17267398
 ]


Steve Loughran commented on SPARK-33605:
----------------------------------------

hadoop-aws and aws-sdk JARs come if you build -Pspark-hadoop-cloud; at that 
point the entire distro gets the full shaded SDK. Include -Pkinesis to get the 
spark kinesis binding too, which uses the same AWS SDK.

To support GCS, adding it as another import to the spark-hadoop-cloud module 
would be the right strategy, as then it goes in only when desired

FWIW, cloudera's products declare GCS as a dependency of the 
hadoop-cloud-storage POM, so Spark gets it (and some other things) without 
needing any POM Changes

{code}
    <dependency>
      <groupId>com.google.cloud.bigdataoss</groupId>
      <artifactId>gcs-connector</artifactId>
      <classifier>shaded</classifier>
    </dependency>
{code}

But: it adds a loop in the build which complicates life, especially if someone 
makes an incompatible change between hadoop-common and hadoop-gcs. Better for 
Spark to pull it in.

Full declaration of GCS import there refers to the shaded POM and then evicts 
all the dependencies the shaded pom still declares it needs. They are *not* 
needed, just vestigal dependencies there to complicate builds.



{code}

      <dependency>
        <groupId>com.google.cloud.bigdataoss</groupId>
        <artifactId>gcs-connector</artifactId>
        <classifier>shaded</classifier>
        <version>${gcs.version}</version>
        <exclusions>
          <exclusion>
            <groupId>com.google.api-client</groupId>
            <artifactId>google-api-client-java6</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.api-client</groupId>
            <artifactId>google-api-client-jackson2</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.apis</groupId>
            <artifactId>google-api-services-storage</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.oauth-client</groupId>
            <artifactId>google-oauth-client</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.oauth-client</groupId>
            <artifactId>google-oauth-client-java6</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.cloud.bigdataoss</groupId>
            <artifactId>util</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.cloud.bigdataoss</groupId>
            <artifactId>gcsio</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.cloud.bigdataoss</groupId>
            <artifactId>util-hadoop</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.code.findbugs</groupId>
            <artifactId>jsr305</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.flogger</groupId>
            <artifactId>*</artifactId>
          </exclusion>
        </exclusions>
      </dependency>
{code}


> Add GCS FS/connector config (dependencies?) akin to S3
> ------------------------------------------------------
>
>                 Key: SPARK-33605
>                 URL: https://issues.apache.org/jira/browse/SPARK-33605
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, Spark Core
>    Affects Versions: 3.0.1
>            Reporter: Rafal Wojdyla
>            Priority: Major
>
> Spark comes with some S3 batteries included, which makes it easier to use 
> with S3, for GCS to work users are required to manually configure the jars. 
> This is especially problematic for python users who may not be accustomed to 
> java dependencies etc. This is an example of workaround for pyspark: 
> [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the 
> [GCS 
> connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage],
>  it would make things easier for GCS users.
> Please let me know what you think.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3

Reply via email to