Ok, setting the hadoopConf before initialize seems to have gotten past that 
hump. I’m debugging an issue with my tests where it appears that I’m not 
cleaning up glue in between test runs but I am cleaning up the filesystem so 
the metadata files no longer exist and it fails. But why is it using metadata 
files at all with the Glue catalog? The “metadata_location” on the Glue table 
paramaters is set to a file path. Is that how this is supposed to work? Does it 
just remove the need for the version-hint.text file but not the metadata.json 
files?

Greg


From: Jack Ye <yezhao...@gmail.com>
Reply-To: "dev@iceberg.apache.org" <dev@iceberg.apache.org>
Date: Thursday, July 8, 2021 at 12:06 PM
To: Iceberg Dev List <dev@iceberg.apache.org>
Subject: Re: GlueCatalog example?

This message was identified as a phishing scam.
I think you need to first call setConf and then initialize, mimicking the logic 
in 
https://github.com/apache/iceberg/blob/6bcca16c48cd92dc98640130a28f73431e99e336/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L189-L191which<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2F6bcca16c48cd92dc98640130a28f73431e99e336%2Fcore%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FCatalogUtil.java%23L189-L191which&data=04%7C01%7Cgnhill%40paypal.com%7Cecf5a300b87540d1bc8f08d94232b77e%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637613607864000826%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bh5giMrRgrY%2F%2FfSt3wI18rhAaGWRsfKsznaUELMPveM%3D&reserved=0>
 is used by all engines to initialize catalogs. You might be able to directly 
leverage the CatalogUtil.buildIcebergCatalog instead of writing your customized 
logic.

With that being said, I remember we had this conversation in another thread and 
did not continue with it, EMRFS consistent view is now unnecessary as S3 is now 
strongly consistent. I am not sure if there is any additional benefit you would 
like to gain by continuing to use EMRFS.

-Jack Ye

On Thu, Jul 8, 2021 at 8:11 AM Greg Hill <gnh...@paypal.com.invalid> wrote:
Thanks! Seems I wasn’t too far off then. It’s my understanding that because 
we’re using EMRFS consistent view, we should not use S3FileIO or the emrfs 
metadata will get out of sync, but it doesn’t seem like this catalog works with 
HadoopFileIO so far in my basic testing. I get a NullPointerException because 
the Hadoop configuration isn’t passed along at some point.

I noticed that I needed to call `setConf()` to get the Hadoop configs into the 
catalog object.

      Map<String, String> props = ImmutableMap.of(
        "type", "iceberg",
        "warehouse", config.getOutputDir(),
        "lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager",
        "lock.table", config.getDynamoIcebergLocksTable(),
        "io-impl", "org.apache.iceberg.hadoop.HadoopFileIO"
      );
      this.icebergCatalog.initialize("iceberg", props);

      this.icebergCatalog.setConf(spark.sparkContext().hadoopConfiguration());

Then when I call createTable later:

java.lang.NullPointerException
                at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:481)
                at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
                at org.apache.iceberg.hadoop.Util.getFs(Util.java:48)
                at 
org.apache.iceberg.hadoop.HadoopOutputFile.fromPath(HadoopOutputFile.java:53)
                at 
org.apache.iceberg.hadoop.HadoopFileIO.newOutputFile(HadoopFileIO.java:64)
                at 
org.apache.iceberg.BaseMetastoreTableOperations.writeNewMetadata(BaseMetastoreTableOperations.java:137)
                at 
org.apache.iceberg.aws.glue.GlueTableOperations.doCommit(GlueTableOperations.java:105)
                at 
org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:118)
                at 
org.apache.iceberg.BaseMetastoreCatalog$BaseMetastoreCatalogTableBuilder.create(BaseMetastoreCatalog.java:215)
                at 
org.apache.iceberg.BaseMetastoreCatalog.createTable(BaseMetastoreCatalog.java:48)
                at 
org.apache.iceberg.catalog.Catalog.createTable(Catalog.java:105)

The NPE is because `conf` is null in that method, but I verified that 
icebergCatalog.hadoopConf is the expected object.

Should it be expected that the GlueCatalog can be used with HadoopFileIO or is 
it only compatible with S3FileIO?

Greg


From: Jack Ye <yezhao...@gmail.com<mailto:yezhao...@gmail.com>>
Reply-To: "dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>" 
<dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>>
Date: Wednesday, July 7, 2021 at 4:16 PM
To: Iceberg Dev List <dev@iceberg.apache.org<mailto:dev@iceberg.apache.org>>
Subject: Re: GlueCatalog example?

This message was identified as a phishing scam.
Yeah this is actually a good point, the documentation is mostly around loading 
the catalog to different SQL engines and lacks Java API examples. The 
integration tests are good places to see Java examples: 
https://github.com/apache/iceberg/blob/master/aws/src/integration/java/org/apache/iceberg/aws/glue/GlueTestBase.java<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Faws%2Fsrc%2Fintegration%2Fjava%2Forg%2Fapache%2Ficeberg%2Faws%2Fglue%2FGlueTestBase.java&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168256361%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dV9Uvdbm4ogsuvADlri%2FuWt2xAuBVA56%2BI8%2Bj3mRs1Y%3D&reserved=0>

-Jack Ye

On Wed, Jul 7, 2021 at 1:27 PM Greg Hill <gnh...@paypal.com.invalid> wrote:
Is there a Java example for the proper way to get the GlueCatalog object? We 
are trying to convert from HadoopTables and need access to the lower-level APIs 
to create and update tables with partitions.

I’m looking for something similar to these examples for HadoopTables and 
HiveCatalog: 
https://iceberg.apache.org/java-api-quickstart/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fjava-api-quickstart%2F&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168266327%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=luUUropvT0UFzgyVtGjmdosqyf%2BFpRpM3oL0Pnu9tK8%3D&reserved=0>

From what I can gather looking at the code, this is what I came up with (our 
catalog name is `iceberg`), but it feels like there’s probably a better way 
that I’m not seeing:

      this.icebergCatalog = new GlueCatalog();
      Configuration conf = spark.sparkContext().hadoopConfiguration();
      Map<String, String> props = ImmutableMap.of(
        "type", conf.get("spark.sql.catalog.iceberg.type"),
        "warehouse", conf.get("spark.sql.catalog.iceberg.warehouse"),
        "lock-impl", conf.get("spark.sql.catalog.iceberg.lock-impl"),
        "lock.table", conf.get("spark.sql.catalog.iceberg.lock.table"),
        "io-impl", conf.get("spark.sql.catalog.iceberg.io-impl")
      );
      this.icebergCatalog.initialize("iceberg", props);

Sorry for the potentially n00b question, but I’m a n00b 😃

Greg

Reply via email to