[GitHub] [iceberg] chulucninh09 opened a new issue #3546: Support for S3FileIO in hadoop catalog

GitBox Sat, 13 Nov 2021 02:44:17 -0800


chulucninh09 opened a new issue #3546:
URL: https://github.com/apache/iceberg/issues/3546



   Hi, I'm using AWS S3 with hadoop catalog for Spark without GlueCatalog.
   
   Then I received an error 
`org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme 
"s3"`
   
   Debugging for a while, I noticed that Hadoop Catalog is implement in a way 
that it will raise unknown s3 scheme exception. The problem will not happen if 
I use `catalog-impl=GlueCatalog`
   
   1. To my case, I only want to use S3 with iceberg, I don't want to use 
GlueCatalog. How can I do that?
   2. Can we make the docs more clear about dependencies and configs in aws? 
Happy to make PR for that.
   3. If it is the problem with the implementation, can someone make a PR
   
   Here is my code, Spark 3.1.2 with Hadoop 3.2.0. No `core-site.xml` file
   
   ```
   import os
   from pyspark.context import SparkContext
   from pyspark import SparkConf
   from pyspark.sql import SQLContext
   
   AWS_SDK_VERSION="2.15.40"
   AWS_MAVEN_GROUP="software.amazon.awssdk"
   AWS_PACKAGES=[
       "bundle",
       "url-connection-client"
   ]
   ICEBERG_VERSION="0.12.1"
   DEPENDENCIES=f"org.apache.iceberg:iceberg-spark3-runtime:{ICEBERG_VERSION}"
   
   for package in AWS_PACKAGES:
     DEPENDENCIES += f",{AWS_MAVEN_GROUP}:{package}:{AWS_SDK_VERSION}"
   
   os.environ['PYSPARK_SUBMIT_ARGS'] = f'--packages {DEPENDENCIES} 
pyspark-shell'
   scConf = ( SparkConf()
     
.set('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
     
.set('spark.sql.catalog.my_catalog','org.apache.iceberg.spark.SparkSessionCatalog')
     .set('spark.sql.catalog.my_catalog.type','hadoop')
     
.set('spark.sql.catalog.my_catalog.warehouse','s3://test-iceberg/test-iceberg-dremio4')
     
.set('spark.sql.catalog.my_catalog.io-impl','org.apache.iceberg.aws.s3.S3FileIO')
     )
   
   sc = SparkContext(conf=scConf).getOrCreate()
   sc.setLogLevel("DEBUG")
   sql = SQLContext(sc)
   
   sql.sql('create table my_catalog.common.table (id bigint, data string) using 
iceberg')
   ```
   
   And here is the log
   ```python
   Py4JJavaError: An error occurred while calling o35.sql.
   : org.apache.iceberg.exceptions.RuntimeIOException: Failed to get file 
system for path: s3://test-iceberg/test-iceberg-dremio4
        at org.apache.iceberg.hadoop.Util.getFs(Util.java:53)
        at 
org.apache.iceberg.hadoop.HadoopCatalog.initialize(HadoopCatalog.java:103)
        at org.apache.iceberg.CatalogUtil.loadCatalog(CatalogUtil.java:193)
        at 
org.apache.iceberg.CatalogUtil.buildIcebergCatalog(CatalogUtil.java:225)
        at 
org.apache.iceberg.spark.SparkCatalog.buildIcebergCatalog(SparkCatalog.java:105)
        at 
org.apache.iceberg.spark.SparkCatalog.initialize(SparkCatalog.java:388)
        at 
org.apache.iceberg.spark.SparkSessionCatalog.buildSparkCatalog(SparkSessionCatalog.java:70)
        at 
org.apache.iceberg.spark.SparkSessionCatalog.initialize(SparkSessionCatalog.java:246)
        at 
org.apache.spark.sql.connector.catalog.Catalogs$.load(Catalogs.scala:61)
   
   Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No 
FileSystem for scheme "s3"
        at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
        at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
        at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
        at org.apache.iceberg.hadoop.Util.getFs(Util.java:51)
        ... 70 more
   ```
   
   GlueCatalog was implemented differently from HadoopCatalog, no scheme 
detection in GlueCatalog:
   
https://github.com/apache/iceberg/blob/b6ce66112bea752513d5317c0813c77ea980643a/aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java#L119-L124
   
   HadoopCatalog has `getFs` to detect scheme and throw error:
   
https://github.com/apache/iceberg/blob/b6ce66112bea752513d5317c0813c77ea980643a/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java#L102


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] chulucninh09 opened a new issue #3546: Support for S3FileIO in hadoop catalog

Reply via email to