[ https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267398#comment-17267398 ]
Steve Loughran commented on SPARK-33605: ---------------------------------------- hadoop-aws and aws-sdk JARs come if you build -Pspark-hadoop-cloud; at that point the entire distro gets the full shaded SDK. Include -Pkinesis to get the spark kinesis binding too, which uses the same AWS SDK. To support GCS, adding it as another import to the spark-hadoop-cloud module would be the right strategy, as then it goes in only when desired FWIW, cloudera's products declare GCS as a dependency of the hadoop-cloud-storage POM, so Spark gets it (and some other things) without needing any POM Changes {code} <dependency> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>gcs-connector</artifactId> <classifier>shaded</classifier> </dependency> {code} But: it adds a loop in the build which complicates life, especially if someone makes an incompatible change between hadoop-common and hadoop-gcs. Better for Spark to pull it in. Full declaration of GCS import there refers to the shaded POM and then evicts all the dependencies the shaded pom still declares it needs. They are *not* needed, just vestigal dependencies there to complicate builds. {code} <dependency> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>gcs-connector</artifactId> <classifier>shaded</classifier> <version>${gcs.version}</version> <exclusions> <exclusion> <groupId>com.google.api-client</groupId> <artifactId>google-api-client-java6</artifactId> </exclusion> <exclusion> <groupId>com.google.api-client</groupId> <artifactId>google-api-client-jackson2</artifactId> </exclusion> <exclusion> <groupId>com.google.apis</groupId> <artifactId>google-api-services-storage</artifactId> </exclusion> <exclusion> <groupId>com.google.oauth-client</groupId> <artifactId>google-oauth-client</artifactId> </exclusion> <exclusion> <groupId>com.google.oauth-client</groupId> <artifactId>google-oauth-client-java6</artifactId> </exclusion> <exclusion> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>util</artifactId> </exclusion> <exclusion> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>gcsio</artifactId> </exclusion> <exclusion> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>util-hadoop</artifactId> </exclusion> <exclusion> <groupId>com.google.code.findbugs</groupId> <artifactId>jsr305</artifactId> </exclusion> <exclusion> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> </exclusion> <exclusion> <groupId>com.google.flogger</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> {code} > Add GCS FS/connector config (dependencies?) akin to S3 > ------------------------------------------------------ > > Key: SPARK-33605 > URL: https://issues.apache.org/jira/browse/SPARK-33605 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core > Affects Versions: 3.0.1 > Reporter: Rafal Wojdyla > Priority: Major > > Spark comes with some S3 batteries included, which makes it easier to use > with S3, for GCS to work users are required to manually configure the jars. > This is especially problematic for python users who may not be accustomed to > java dependencies etc. This is an example of workaround for pyspark: > [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the > [GCS > connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage], > it would make things easier for GCS users. > Please let me know what you think. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org