chulucninh09 opened a new issue #3546:
URL: https://github.com/apache/iceberg/issues/3546
Hi, I'm using AWS S3 with hadoop catalog for Spark without GlueCatalog.
Then I received an error
`org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme
"s3"`
Debugging for a while, I noticed that Hadoop Catalog is implement in a way
that it will raise unknown s3 scheme exception. The problem will not happen if
I use `catalog-impl=GlueCatalog`
1. To my case, I only want to use S3 with iceberg, I don't want to use
GlueCatalog. How can I do that?
2. Can we make the docs more clear about dependencies and configs in aws?
Happy to make PR for that.
3. If it is the problem with the implementation, can someone make a PR
Here is my code, Spark 3.1.2 with Hadoop 3.2.0. No `core-site.xml` file
```
import os
from pyspark.context import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
AWS_SDK_VERSION="2.15.40"
AWS_MAVEN_GROUP="software.amazon.awssdk"
AWS_PACKAGES=[
"bundle",
"url-connection-client"
]
ICEBERG_VERSION="0.12.1"
DEPENDENCIES=f"org.apache.iceberg:iceberg-spark3-runtime:{ICEBERG_VERSION}"
for package in AWS_PACKAGES:
DEPENDENCIES += f",{AWS_MAVEN_GROUP}:{package}:{AWS_SDK_VERSION}"
os.environ['PYSPARK_SUBMIT_ARGS'] = f'--packages {DEPENDENCIES}
pyspark-shell'
scConf = ( SparkConf()
.set('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
.set('spark.sql.catalog.my_catalog','org.apache.iceberg.spark.SparkSessionCatalog')
.set('spark.sql.catalog.my_catalog.type','hadoop')
.set('spark.sql.catalog.my_catalog.warehouse','s3://test-iceberg/test-iceberg-dremio4')
.set('spark.sql.catalog.my_catalog.io-impl','org.apache.iceberg.aws.s3.S3FileIO')
)
sc = SparkContext(conf=scConf).getOrCreate()
sc.setLogLevel("DEBUG")
sql = SQLContext(sc)
sql.sql('create table my_catalog.common.table (id bigint, data string) using
iceberg')
```
And here is the log
```python
Py4JJavaError: An error occurred while calling o35.sql.
: org.apache.iceberg.exceptions.RuntimeIOException: Failed to get file
system for path: s3://test-iceberg/test-iceberg-dremio4
at org.apache.iceberg.hadoop.Util.getFs(Util.java:53)
at
org.apache.iceberg.hadoop.HadoopCatalog.initialize(HadoopCatalog.java:103)
at org.apache.iceberg.CatalogUtil.loadCatalog(CatalogUtil.java:193)
at
org.apache.iceberg.CatalogUtil.buildIcebergCatalog(CatalogUtil.java:225)
at
org.apache.iceberg.spark.SparkCatalog.buildIcebergCatalog(SparkCatalog.java:105)
at
org.apache.iceberg.spark.SparkCatalog.initialize(SparkCatalog.java:388)
at
org.apache.iceberg.spark.SparkSessionCatalog.buildSparkCatalog(SparkSessionCatalog.java:70)
at
org.apache.iceberg.spark.SparkSessionCatalog.initialize(SparkSessionCatalog.java:246)
at
org.apache.spark.sql.connector.catalog.Catalogs$.load(Catalogs.scala:61)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No
FileSystem for scheme "s3"
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.iceberg.hadoop.Util.getFs(Util.java:51)
... 70 more
```
GlueCatalog was implemented differently from HadoopCatalog, no scheme
detection in GlueCatalog:
https://github.com/apache/iceberg/blob/b6ce66112bea752513d5317c0813c77ea980643a/aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java#L119-L124
HadoopCatalog has `getFs` to detect scheme and throw error:
https://github.com/apache/iceberg/blob/b6ce66112bea752513d5317c0813c77ea980643a/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java#L102
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]