stevenayers opened a new issue, #5455:
URL: https://github.com/apache/hudi/issues/5455
Hi All,
I'm currently using AWS Glue Catalog as my Hive Metastore and Glue ETL 2.0
(soon to be 3.0) with Hudi (AWS Hudi Connector 0.9.0 for Glue 1.0 and 2.0).
In Iceberg, you are able to do the following to query the Glue catalog:
```python
df = glueContext.create_dynamic_frame.from_options(
connection_type="marketplace.spark",
connection_options={
"path": "my_catalog.my_glue_database.my_iceberg_table",
"connectionName": "Iceberg Connector for Glue 3.0",
},
transformation_ctx="IcebergDyF",
).toDF()
```
I'd like to do something similar with Hudi:
```python
df = glueContext.create_dynamic_frame.from_options(
connection_type="marketplace.spark",
connection_options= {
"className": "org.apache.hudi",
"hoodie.table.name": "my_hudi_table",
"hoodie.consistency.check.enabled": "true",
"hoodie.datasource.hive_sync.use_jdbc": "false",
"hoodie.datasource.hive_sync.database": "my_glue_database",
"hoodie.datasource.hive_sync.table": "my_hudi_table",
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.hive_sync.partition_extractor_class":
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.partition_fields": partition_key
},
transformation_ctx="IcebergDyF",
)
```
Meaning we don't need to grab the S3 path of our data from boto3 every time,
like so:
```python
client = boto3.client('glue')
response = client.get_table(
DatabaseName='my_glue_database',
Name='my_hudi_table'
) <<----- don't want this
targetPath = response['Table']['StorageDescriptor']['Location'] <<----- or
this
df = glueContext.create_dynamic_frame.from_options(
connection_type="marketplace.spark",
connection_options= {
"className": "org.apache.hudi",
"path": targetPath <<----- or this
"hoodie.table.name": "my_hudi_table",
"hoodie.consistency.check.enabled": "true",
"hoodie.datasource.hive_sync.use_jdbc": "false",
"hoodie.datasource.hive_sync.database": "my_glue_database",
"hoodie.datasource.hive_sync.table": "my_hudi_table",
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.hive_sync.partition_extractor_class":
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.partition_fields": partition_key
},
transformation_ctx="HudiDyF",
)
# OR
sourceTableDF = spark.read.format('hudi').load(targetPath)
```
Is there any way to do this? Very new to Hudi, so if my configuration
settings are wrong and this is possible, please let me know!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]