[GitHub] [hudi] stevenayers opened a new issue, #5455: [SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

GitBox Wed, 27 Apr 2022 23:42:25 -0700


stevenayers opened a new issue, #5455:
URL: https://github.com/apache/hudi/issues/5455


   Hi All,
   
   I'm currently using AWS Glue Catalog as my Hive Metastore and Glue ETL 2.0 
(soon to be 3.0) with Hudi (AWS Hudi Connector 0.9.0 for Glue 1.0 and 2.0).
   
   In Iceberg, you are able to do the following to query the Glue catalog:
   ```python
   df = glueContext.create_dynamic_frame.from_options(
           connection_type="marketplace.spark",
           connection_options={
               "path": "my_catalog.my_glue_database.my_iceberg_table",
               "connectionName": "Iceberg Connector for Glue 3.0",
           },
           transformation_ctx="IcebergDyF",
       ).toDF()
   ```
   
   I'd like to do something similar with Hudi:
   ```python
   df = glueContext.create_dynamic_frame.from_options(
           connection_type="marketplace.spark",
           connection_options= {
               "className": "org.apache.hudi",
               "hoodie.table.name": "my_hudi_table",
               "hoodie.consistency.check.enabled": "true",
               "hoodie.datasource.hive_sync.use_jdbc": "false",
               "hoodie.datasource.hive_sync.database": "my_glue_database",
               "hoodie.datasource.hive_sync.table":  "my_hudi_table",
               "hoodie.datasource.hive_sync.enable": "true",
               "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
               "hoodie.datasource.hive_sync.partition_fields": partition_key
           },
           transformation_ctx="IcebergDyF",
       )
   ```
   
   Meaning we don't need to grab the S3 path of our data from boto3 every time, 
like so:
   ```python
   client = boto3.client('glue')
   response = client.get_table(
       DatabaseName='my_glue_database',
       Name='my_hudi_table'
   ) <<----- don't want this
   targetPath = response['Table']['StorageDescriptor']['Location'] <<----- or 
this
   df = glueContext.create_dynamic_frame.from_options(
           connection_type="marketplace.spark",
           connection_options= {
               "className": "org.apache.hudi",
               "path": targetPath <<----- or this
               "hoodie.table.name": "my_hudi_table",
               "hoodie.consistency.check.enabled": "true",
               "hoodie.datasource.hive_sync.use_jdbc": "false",
               "hoodie.datasource.hive_sync.database": "my_glue_database",
               "hoodie.datasource.hive_sync.table":  "my_hudi_table",
               "hoodie.datasource.hive_sync.enable": "true",
               "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
               "hoodie.datasource.hive_sync.partition_fields": partition_key
           },
           transformation_ctx="HudiDyF",
       )
   # OR
   sourceTableDF = spark.read.format('hudi').load(targetPath)
   ```
   
   Is there any way to do this? Very new to Hudi, so if my configuration 
settings are wrong and this is possible, please let me know!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] stevenayers opened a new issue, #5455: [SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

Reply via email to