Thanks for clarifying. I will investigate your solution later for using S3FileIO.
On Tue, Aug 17, 2021 at 11:40 AM Jack Ye <yezhao...@gmail.com> wrote: > Good to hear the issue is fixed! > > ACL is optional, as the javadoc says, "If not set, ACL will not be set for > requests". > > But I think to use MinIO you need to use a custom client factory to set > your S3 endpoint as that MinIO endpoint. > > -Jack > > On Tue, Aug 17, 2021 at 11:36 AM Lian Jiang <jiangok2...@gmail.com> wrote: > >> Hi Ryan, >> >> S3FileIO need canned ACL according to: >> >> /** >> * Used to configure canned access control list (ACL) for S3 client to >> use during write. >> * If not set, ACL will not be set for requests. >> * <p> >> * The input must be one of {@link >> software.amazon.awssdk.services.s3.model.ObjectCannedACL}, >> * such as 'public-read-write' >> * For more details: >> https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html >> */ >> public static final String S3FILEIO_ACL = "s3.acl"; >> >> >> Minio does not support canned ACL according to >> https://docs.min.io/docs/minio-server-limits-per-tenant.html: >> >> List of Amazon S3 Bucket API's not supported on MinIO >> >> - BucketACL (Use bucket policies >> <https://docs.min.io/docs/minio-client-complete-guide#policy> instead) >> - BucketCORS (CORS enabled by default on all buckets for all HTTP >> verbs) >> - BucketWebsite (Use caddy <https://github.com/caddyserver/caddy> or >> nginx <https://www.nginx.com/resources/wiki/>) >> - BucketAnalytics, BucketMetrics, BucketLogging (Use bucket >> notification >> <https://docs.min.io/docs/minio-client-complete-guide#events> APIs) >> - BucketRequestPayment >> >> List of Amazon S3 Object API's not supported on MinIO >> >> - ObjectACL (Use bucket policies >> <https://docs.min.io/docs/minio-client-complete-guide#policy> instead) >> - ObjectTorrent >> >> >> >> Hope this makes sense. >> >> BTW, iceberg + Hive + S3A works after Hive using S3A issue has been >> fixed. Thanks Jack for helping debugging. >> >> >> >> On Tue, Aug 17, 2021 at 8:38 AM Ryan Blue <b...@tabular.io> wrote: >> >>> I'm not sure that I'm following why MinIO won't work with S3FileIO. >>> S3FileIO assumes that the credentials are handled by a credentials provider >>> outside of S3FileIO. How does MinIO handle credentials? >>> >>> Ryan >>> >>> On Mon, Aug 16, 2021 at 7:57 PM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> Talked with Lian on Slack, the user is using a hadoop 3.2.1 + hive >>>> (postgres) + spark + minio docker installation. There might be some S3A >>>> related dependencies missing on the Hive server side based on the stack >>>> trace. Let's see if that fixes the issue. >>>> -Jack >>>> >>>> On Mon, Aug 16, 2021 at 7:32 PM Lian Jiang <jiangok2...@gmail.com> >>>> wrote: >>>> >>>>> This is my full script launching spark-shell: >>>>> >>>>> # add Iceberg dependency >>>>> export AWS_REGION=us-east-1 >>>>> export AWS_ACCESS_KEY_ID=minio >>>>> export AWS_SECRET_ACCESS_KEY=minio123 >>>>> >>>>> ICEBERG_VERSION=0.11.1 >>>>> >>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0" >>>>> >>>>> MINIOSERVER=192.168.176.5 >>>>> >>>>> >>>>> # add AWS dependnecy >>>>> AWS_SDK_VERSION=2.15.40 >>>>> AWS_MAVEN_GROUP=software.amazon.awssdk >>>>> AWS_PACKAGES=( >>>>> "bundle" >>>>> "url-connection-client" >>>>> ) >>>>> for pkg in "${AWS_PACKAGES[@]}"; do >>>>> DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" >>>>> done >>>>> >>>>> # start Spark SQL client shell >>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>> --conf >>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>> --conf >>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO >>>>> \ >>>>> --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \ >>>>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ >>>>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>>>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>>>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>>>> --conf >>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>>>> >>>>> >>>>> Let me know if anything is missing. Thanks. >>>>> >>>>> On Mon, Aug 16, 2021 at 7:29 PM Jack Ye <yezhao...@gmail.com> wrote: >>>>> >>>>>> Have you included the hadoop-aws jar? >>>>>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws >>>>>> -Jack >>>>>> >>>>>> On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <jiangok2...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Jack, >>>>>>> >>>>>>> You are right. S3FileIO will not work on minio since minio does not >>>>>>> support ACL: >>>>>>> https://docs.min.io/docs/minio-server-limits-per-tenant.html >>>>>>> >>>>>>> To use iceberg, minio + s3a, I used below script to launch >>>>>>> spark-shell: >>>>>>> >>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>>>> --conf >>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>>>> * --conf >>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO >>>>>>> \* >>>>>>> --conf >>>>>>> spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \ >>>>>>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ >>>>>>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>>>>>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>>>>>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>>>>>> --conf >>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>>>>>> >>>>>>> >>>>>>> >>>>>>> *The spark code:* >>>>>>> >>>>>>> import org.apache.spark.sql.SparkSession >>>>>>> val values = List(1,2,3,4,5) >>>>>>> >>>>>>> val spark = SparkSession.builder().master("local").getOrCreate() >>>>>>> import spark.implicits._ >>>>>>> val df = values.toDF() >>>>>>> >>>>>>> val core = "mytable" >>>>>>> val table = s"hive_test.mydb.${core}" >>>>>>> val s3IcePath = s"s3a://east/${core}.ice" >>>>>>> >>>>>>> df.writeTo(table) >>>>>>> .tableProperty("write.format.default", "parquet") >>>>>>> .tableProperty("location", s3IcePath) >>>>>>> .createOrReplace() >>>>>>> >>>>>>> >>>>>>> *Still the same error:* >>>>>>> java.lang.ClassNotFoundException: Class >>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found >>>>>>> >>>>>>> >>>>>>> What else could be wrong? Thanks for any clue. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <yezhao...@gmail.com> wrote: >>>>>>> >>>>>>>> Sorry for the late reply, I thought I replied on Friday but the >>>>>>>> email did not send successfully. >>>>>>>> >>>>>>>> As Daniel said, you don't need to setup S3A if you are using >>>>>>>> S3FileIO. >>>>>>>> >>>>>>>> Th S3FileIO by default reads the default credentials chain to check >>>>>>>> credential setups one by one: >>>>>>>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain >>>>>>>> >>>>>>>> If you would like to use a specialized credential provider, you can >>>>>>>> directly customize your S3 client: >>>>>>>> https://iceberg.apache.org/aws/#aws-client-customization >>>>>>>> >>>>>>>> It looks like you are trying to use MinIO to mount S3A file system? >>>>>>>> If you have to use MinIO then there is not a way to integrate with >>>>>>>> S3FileIO >>>>>>>> right now. (maybe I am wrong on this, I don't know much about MinIO) >>>>>>>> >>>>>>>> To directly use S3FileIO with HiveCatalog, simply do: >>>>>>>> >>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>>>>> --conf >>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>>>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>>>>> --conf >>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO >>>>>>>> \ >>>>>>>> --conf spark.sql.catalog.hive_test.warehouse=s3://bucket >>>>>>>> >>>>>>>> Best, >>>>>>>> Jack Ye >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <jiangok2...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you >>>>>>>>> have a sample using hive catalog, s3FileIO, spark API (as opposed to >>>>>>>>> SQL), >>>>>>>>> S3 access.key and secret.key? It is hard to get all settings right >>>>>>>>> for this >>>>>>>>> combination without an example. Appreciate any help. >>>>>>>>> >>>>>>>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks < >>>>>>>>> daniel.c.we...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> So, if I recall correctly, the hive server does need access to >>>>>>>>>> check and create paths for table locations. >>>>>>>>>> >>>>>>>>>> There may be an option to disable this behavior, but otherwise >>>>>>>>>> the fs implementation probably needs to be available to the hive >>>>>>>>>> metastore. >>>>>>>>>> >>>>>>>>>> -Dan >>>>>>>>>> >>>>>>>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <jiangok2...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks Daniel. >>>>>>>>>>> >>>>>>>>>>> After modifying the script to, >>>>>>>>>>> >>>>>>>>>>> export AWS_REGION=us-east-1 >>>>>>>>>>> export AWS_ACCESS_KEY_ID=minio >>>>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123 >>>>>>>>>>> >>>>>>>>>>> ICEBERG_VERSION=0.11.1 >>>>>>>>>>> >>>>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0" >>>>>>>>>>> >>>>>>>>>>> MINIOSERVER=192.168.160.5 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> # add AWS dependnecy >>>>>>>>>>> AWS_SDK_VERSION=2.15.40 >>>>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk >>>>>>>>>>> AWS_PACKAGES=( >>>>>>>>>>> "bundle" >>>>>>>>>>> "url-connection-client" >>>>>>>>>>> ) >>>>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do >>>>>>>>>>> DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" >>>>>>>>>>> done >>>>>>>>>>> >>>>>>>>>>> # start Spark SQL client shell >>>>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>>>>>>>> --conf >>>>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>>>>>>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>>>>>>>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 >>>>>>>>>>> \ >>>>>>>>>>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>>>>>>>>>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>>>>>>>>>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>>>>>>>>>> --conf >>>>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>>>>>>>>>> >>>>>>>>>>> I got: MetaException: java.lang.RuntimeException: >>>>>>>>>>> java.lang.ClassNotFoundException: Class >>>>>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is >>>>>>>>>>> not >>>>>>>>>>> using s3 and should not cause this error. Any ideas? Thanks. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I got "ClassNotFoundException: Class >>>>>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what >>>>>>>>>>> dependency >>>>>>>>>>> could I miss? >>>>>>>>>>> >>>>>>>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks < >>>>>>>>>>> daniel.c.we...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hey Lian, >>>>>>>>>>>> >>>>>>>>>>>> At a cursory glance, it appears that you might be mixing two >>>>>>>>>>>> different FileIO implementations, which may be why you are not >>>>>>>>>>>> getting the >>>>>>>>>>>> expected result. >>>>>>>>>>>> >>>>>>>>>>>> When you set: --conf >>>>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO >>>>>>>>>>>> you're >>>>>>>>>>>> actually switching over to the native S3 implementation within >>>>>>>>>>>> Iceberg (as >>>>>>>>>>>> opposed to S3AFileSystem via HadoopFileIO). However, all of the >>>>>>>>>>>> following >>>>>>>>>>>> settings to setup access are then set for the S3AFileSystem (which >>>>>>>>>>>> would >>>>>>>>>>>> not be used with S3FileIO). >>>>>>>>>>>> >>>>>>>>>>>> You might try just removing that line since it should use the >>>>>>>>>>>> HadoopFileIO at that point and may work. >>>>>>>>>>>> >>>>>>>>>>>> Hope that's helpful, >>>>>>>>>>>> -Dan >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang < >>>>>>>>>>>> jiangok2...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I try to create an iceberg table on minio s3 and hive. >>>>>>>>>>>>> >>>>>>>>>>>>> *This is how I launch spark-shell:* >>>>>>>>>>>>> >>>>>>>>>>>>> # add Iceberg dependency >>>>>>>>>>>>> export AWS_REGION=us-east-1 >>>>>>>>>>>>> export AWS_ACCESS_KEY_ID=minio >>>>>>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123 >>>>>>>>>>>>> >>>>>>>>>>>>> ICEBERG_VERSION=0.11.1 >>>>>>>>>>>>> >>>>>>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION" >>>>>>>>>>>>> >>>>>>>>>>>>> MINIOSERVER=192.168.160.5 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> # add AWS dependnecy >>>>>>>>>>>>> AWS_SDK_VERSION=2.15.40 >>>>>>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk >>>>>>>>>>>>> AWS_PACKAGES=( >>>>>>>>>>>>> "bundle" >>>>>>>>>>>>> "url-connection-client" >>>>>>>>>>>>> ) >>>>>>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do >>>>>>>>>>>>> DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" >>>>>>>>>>>>> done >>>>>>>>>>>>> >>>>>>>>>>>>> # start Spark SQL client shell >>>>>>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>>>>>>>>>> --conf >>>>>>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog >>>>>>>>>>>>> \ >>>>>>>>>>>>> --conf >>>>>>>>>>>>> spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \ >>>>>>>>>>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>>>>>>>>>> --conf >>>>>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO >>>>>>>>>>>>> \ >>>>>>>>>>>>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 >>>>>>>>>>>>> \ >>>>>>>>>>>>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>>>>>>>>>>>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>>>>>>>>>>>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>>>>>>>>>>>> --conf >>>>>>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>>>>>>>>>>>> >>>>>>>>>>>>> *Here is the spark code to create the iceberg table:* >>>>>>>>>>>>> >>>>>>>>>>>>> import org.apache.spark.sql.SparkSession >>>>>>>>>>>>> val values = List(1,2,3,4,5) >>>>>>>>>>>>> >>>>>>>>>>>>> val spark = >>>>>>>>>>>>> SparkSession.builder().master("local").getOrCreate() >>>>>>>>>>>>> import spark.implicits._ >>>>>>>>>>>>> val df = values.toDF() >>>>>>>>>>>>> >>>>>>>>>>>>> val core = "mytable8" >>>>>>>>>>>>> val table = s"hive_test.mydb.${core}" >>>>>>>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice" >>>>>>>>>>>>> >>>>>>>>>>>>> df.writeTo(table) >>>>>>>>>>>>> .tableProperty("write.format.default", "parquet") >>>>>>>>>>>>> .tableProperty("location", s3IcePath) >>>>>>>>>>>>> .createOrReplace() >>>>>>>>>>>>> >>>>>>>>>>>>> I got an error "The AWS Access Key Id you provided does not >>>>>>>>>>>>> exist in our records.". >>>>>>>>>>>>> >>>>>>>>>>>>> I have verified that I can login minio UI using the same >>>>>>>>>>>>> username and password that I passed to spark-shell via >>>>>>>>>>>>> AWS_ACCESS_KEY_ID >>>>>>>>>>>>> and AWS_SECRET_ACCESS_KEY env variables. >>>>>>>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but >>>>>>>>>>>>> does not help me. Not sure why the credential does not work for >>>>>>>>>>>>> iceberg + >>>>>>>>>>>>> AWS. Any idea or an example of writing an iceberg table to S3 >>>>>>>>>>>>> using hive >>>>>>>>>>>>> catalog will be highly appreciated! Thanks. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> Create your own email signature >>>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Create your own email signature >>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Create your own email signature >>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> Create your own email signature >>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> >> >> -- >> >> Create your own email signature >> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >> > -- Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>