ChristinaTech commented on code in PR #6742:
URL: https://github.com/apache/iceberg/pull/6742#discussion_r1186773219
##########
aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java:
##########
@@ -111,7 +117,8 @@ public GlueCatalog() {}
@Override
public void initialize(String name, Map<String, String> properties) {
- this.catalogProperties = ImmutableMap.copyOf(properties);
+ this.catalogProperties = new HashMap<>();
+ catalogProperties.putAll(properties);
Review Comment:
Making CatalogProperties non-immutable in order to effectively side-load
information into the AWS client factory post-initialization is a dangerous
precedent to set. In addition, it doesn't seem to actually accomplish anything
besides allowing the GlueCatalog to suddenly switch regions
post-initialization, which is likely to introduce some dangerous side effects.
##########
aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java:
##########
@@ -437,6 +444,81 @@ public void renameTable(TableIdentifier from,
TableIdentifier to) {
LOG.info("Successfully renamed table from {} to {}", from, to);
}
+ @Override
+ public org.apache.iceberg.Table registerTable(
+ TableIdentifier identifier, String metadataFileLocation) {
+ Preconditions.checkArgument(
+ isValidIdentifier(identifier), "Table identifier to register is
invalid: " + identifier);
+ Preconditions.checkArgument(
+ metadataFileLocation != null && !metadataFileLocation.isEmpty(),
+ "Cannot register an empty metadata file location as a table");
+
+ // keep the original behavior when force-register-table flag is off
+ if (!awsProperties.glueCatalogForceRegisterTable()) {
+ return super.registerTable(identifier, metadataFileLocation);
+ }
+
+ String factoryImpl =
+ PropertyUtil.propertyAsString(catalogProperties,
AwsProperties.CLIENT_FACTORY, null);
+ if (factoryImpl != null &&
factoryImpl.equals(AssumeRoleAwsClientFactory.class.getName())) {
Review Comment:
It hurts extensibility to make logic specific to a particular implementation
class. If, for example, a customer needs to extend `AssumeRoleAwsClientFactory`
to add some functionality for their own use, this logic will break as the class
name will no longer match.
##########
aws/src/main/java/org/apache/iceberg/aws/AwsProperties.java:
##########
@@ -187,6 +187,18 @@ public class AwsProperties implements Serializable {
*/
public static final String GLUE_CATALOG_ENDPOINT = "glue.endpoint";
+ /**
+ * If set, Glue will always update the catalog table if the table already
exists in glue catalog.
+ * By default, Glue catalog will only be able to create new table and will
throw
+ * AlreadyExistsException when register an existing table name.
+ */
+ public static final String GLUE_CATALOG_FORCE_REGISTER_TABLE =
"glue.force-register-table";
+
+ public static final boolean GLUE_CATALOG_FORCE_REGISTER_TABLE_DEFAULT =
false;
+
+ /** Configure the Glue Catalog S3 FileIO Region to allow cross region s3
access */
+ public static final String GLUE_CATALOG_FILE_IO_REGION =
"glue.catalog-file-io-region";
Review Comment:
So we already have `client.region` which sets the region for all services. I
am going to be introducing a change in the near future that will allow setting
per-region for the default AWS Client Factory (and we can extend it to the
Assume Role client factory as well), so we probably want a more general
per-service naming scheme. Based on how existing parameters are formatted where
`glue.` and `s3.` are already established prefixes, most likely something like:
`glue.region`, `s3.region`, `kms.region`, etc...
##########
aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java:
##########
@@ -437,6 +444,81 @@ public void renameTable(TableIdentifier from,
TableIdentifier to) {
LOG.info("Successfully renamed table from {} to {}", from, to);
}
+ @Override
+ public org.apache.iceberg.Table registerTable(
+ TableIdentifier identifier, String metadataFileLocation) {
+ Preconditions.checkArgument(
+ isValidIdentifier(identifier), "Table identifier to register is
invalid: " + identifier);
+ Preconditions.checkArgument(
+ metadataFileLocation != null && !metadataFileLocation.isEmpty(),
+ "Cannot register an empty metadata file location as a table");
+
+ // keep the original behavior when force-register-table flag is off
+ if (!awsProperties.glueCatalogForceRegisterTable()) {
+ return super.registerTable(identifier, metadataFileLocation);
+ }
+
+ String factoryImpl =
+ PropertyUtil.propertyAsString(catalogProperties,
AwsProperties.CLIENT_FACTORY, null);
+ if (factoryImpl != null &&
factoryImpl.equals(AssumeRoleAwsClientFactory.class.getName())) {
+ // overwrite client assume_role_region for file IO to make cross region
call
+ String catalogFileIORegion = awsProperties.getGlueCatalogFileIORegion();
+ if (catalogFileIORegion != null) {
+ catalogProperties.put(AwsProperties.CLIENT_ASSUME_ROLE_REGION,
catalogFileIORegion);
+ }
+ }
+
+ TableOperations ops = newTableOps(identifier);
+ InputFile metadataFile = ops.io().newInputFile(metadataFileLocation);
+ TableMetadata metadata = TableMetadataParser.read(ops.io(), metadataFile);
+
+ Map<String, String> tableParameters =
+ ImmutableMap.of(
+ BaseMetastoreTableOperations.TABLE_TYPE_PROP,
+
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toLowerCase(Locale.ENGLISH),
+ BaseMetastoreTableOperations.METADATA_LOCATION_PROP,
+ metadataFileLocation);
+
+ String databaseName =
+ IcebergToGlueConverter.getDatabaseName(
+ identifier, awsProperties.glueCatalogSkipNameValidation());
+ String tableName =
+ IcebergToGlueConverter.getTableName(
+ identifier, awsProperties.glueCatalogSkipNameValidation());
+
+ TableInput tableInput =
+ TableInput.builder()
+ .applyMutation(
+ builder ->
+ IcebergToGlueConverter.setTableInputInformation(
+ builder, metadata, tableParameters))
+ .name(tableName)
+ .tableType(GlueTableOperations.GLUE_EXTERNAL_TABLE_TYPE)
+ .parameters(tableParameters)
+ .build();
+
+ try {
+ glue.createTable(
+
CreateTableRequest.builder().databaseName(databaseName).tableInput(tableInput).build());
+ } catch (software.amazon.awssdk.services.glue.model.AlreadyExistsException
e) {
+ GetTableResponse response =
+ glue.getTable(
+
GetTableRequest.builder().databaseName(databaseName).name(tableName).build());
+ String versionId = response.table().versionId();
+ glue.updateTable(
+ UpdateTableRequest.builder()
+ .databaseName(databaseName)
+ .tableInput(tableInput)
+ .versionId(versionId)
+ .build());
+ } catch (EntityNotFoundException e) {
Review Comment:
So this logic is a combination of a fork of the logic in
`BaseMetastoreCatalog.registerTable` and
`GlueTableOperations.persistGlueTable`. As `GlueTableOperations` also has
access to `AwsProperties`, I would recommend refactoring the logic in
GlueTableOperations so that `persistGlueTable` can conditionally fall back to
its update mode as that improves code reuse.
If the concern is the creation of an extra metadata file, it looks like
`GlueTableOperations` already checks whether it needs to during the
`writeNewMetadataIfRequired` function and considering tableMetadata is always
set for `registerTable`, it will never choose to write a new metadata file
anyways.
##########
aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java:
##########
@@ -437,6 +444,81 @@ public void renameTable(TableIdentifier from,
TableIdentifier to) {
LOG.info("Successfully renamed table from {} to {}", from, to);
}
+ @Override
+ public org.apache.iceberg.Table registerTable(
+ TableIdentifier identifier, String metadataFileLocation) {
+ Preconditions.checkArgument(
+ isValidIdentifier(identifier), "Table identifier to register is
invalid: " + identifier);
+ Preconditions.checkArgument(
+ metadataFileLocation != null && !metadataFileLocation.isEmpty(),
+ "Cannot register an empty metadata file location as a table");
+
+ // keep the original behavior when force-register-table flag is off
+ if (!awsProperties.glueCatalogForceRegisterTable()) {
+ return super.registerTable(identifier, metadataFileLocation);
+ }
+
+ String factoryImpl =
+ PropertyUtil.propertyAsString(catalogProperties,
AwsProperties.CLIENT_FACTORY, null);
+ if (factoryImpl != null &&
factoryImpl.equals(AssumeRoleAwsClientFactory.class.getName())) {
+ // overwrite client assume_role_region for file IO to make cross region
call
+ String catalogFileIORegion = awsProperties.getGlueCatalogFileIORegion();
+ if (catalogFileIORegion != null) {
+ catalogProperties.put(AwsProperties.CLIENT_ASSUME_ROLE_REGION,
catalogFileIORegion);
+ }
Review Comment:
Can't this logic and the associated parameter be removed and replaced with
just setting `AwsProperties.CLIENT_ASSUME_ROLE_REGION` directly? This change is
not in any way scoped to this call or AWS service, it effectively creates a
case where what region the GlueCatalog is calling can suddenly change
post-initialization after the first time `registerTable` is called.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]