[
https://issues.apache.org/jira/browse/SPARK-36936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426701#comment-17426701
]
Colin Williams edited comment on SPARK-36936 at 10/9/21, 8:36 PM:
------------------------------------------------------------------
But the Spark 3.1.2 documentation
[https://spark.apache.org/docs/latest/cloud-integration.html] states:
<dependencyManagement>
...
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>hadoop-cloud_2.12</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
...
</dependencyManagement>
For which I show an artifact for 3.1.2 does not exist.
2021.10.09 13:34:47 INFO [error] (update)
sbt.librarymanagement.ResolveException: Error downloading
org.apache.spark:spark-hadoop-cloud_2.12:3.1.2
2021.10.09 13:34:47 INFO [error] Not found
2021.10.09 13:34:47 INFO [error] Not found
2021.10.09 13:34:47 INFO [error] not found:
/home/colin/.ivy2/local/org.apache.spark/spark-hadoop-cloud_2.12/3.1.2/ivys/ivy.xml
2021.10.09 13:34:47 INFO [error] not found:
https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/3.1.2/spark-hadoop-cloud_2.12-3.1.2.pom
2021.10.09 13:34:47 INFO [error] not found:
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-hadoop-cloud_2.12/3.1.2/spark-hadoop-cloud_2.12-3.1.2.po
was (Author: colin.williams):
But the Spark 3.1.2 documentation
https://spark.apache.org/docs/latest/cloud-integration.html states:
<dependencyManagement>
...
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>hadoop-cloud_2.12</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
...
</dependencyManagement>
For which I have shown an artifact for 3.1.2 does not exist.
> spark-hadoop-cloud broken on release and only published via 3rd party
> repositories
> ----------------------------------------------------------------------------------
>
> Key: SPARK-36936
> URL: https://issues.apache.org/jira/browse/SPARK-36936
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 3.1.1, 3.1.2
> Environment: name:=spark-demo
> version := "0.0.1"
> scalaVersion := "2.12.12"
> lazy val app = (project in file("app")).settings(
> assemblyPackageScala / assembleArtifact := false,
> assembly / assemblyJarName := "uber.jar",
> assembly / mainClass := Some("com.example.Main"),
> // more settings here ...
> )
> resolvers += "Cloudera" at
> "https://repository.cloudera.com/artifactory/cloudera-repos/"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" %
> "3.1.1.3.1.7270.0-253"
> libraryDependencies += "org.apache.hadoop" % "hadoop-aws" %
> "3.1.1.7.2.7.0-184"
> libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901"
> libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
> // test suite settings
> fork in Test := true
> javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M",
> "-XX:+CMSClassUnloadingEnabled")
> // Show runtime of tests
> testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD")
> ___________________________________________________________________________________________
>
> import org.apache.spark.sql.SparkSession
> object SparkApp {
> def main(args: Array[String]){
> val spark = SparkSession.builder().master("local")
> //.config("spark.jars.repositories",
> "https://repository.cloudera.com/artifactory/cloudera-repos/")
> //.config("spark.jars.packages",
> "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253")
> .appName("spark session").getOrCreate
> val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json")
> val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv")
> jsonDF.show()
> csvDF.show()
> }
> }
> Reporter: Colin Williams
> Priority: Major
>
> The spark docmentation suggests using `spark-hadoop-cloud` to read / write
> from S3 in [https://spark.apache.org/docs/latest/cloud-integration.html] .
> However artifacts are currently published via only 3rd party resolvers in
> [https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud]
> including Cloudera and Palantir.
>
> Then apache spark documentation is providing a 3rd party solution for object
> stores including S3. Furthermore, if you follow the instructions and include
> one of the 3rd party jars IE the Cloudera jar with the spark 3.1.2 release
> and try to access object store, the following exception is returned.
>
> ```
> Exception in thread "main" java.lang.NoSuchMethodError: 'void
> com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String,
> java.lang.Object, java.lang.Object)'
> at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894)
> at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870)
> at
> org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
> at
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
> at
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
> at scala.Option.getOrElse(Option.scala:189)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
> at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
> at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:428)
> ```
> It looks like there are classpath conflicts using the cloudera published
> `spark-hadoop-cloud` with spark 3.1.2, again contradicting the documentation.
> Then the documented `spark-hadoop-cloud` approach to using object stores is
> poorly supported only by 3rd party repositories and not by the released
> apache spark whose documentation refers to it.
> Perhaps one day apache spark will provide tested software so that developers
> can quickly and easily access cloud object stores using the documentation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]