Re: [PR] [#9312] feat (client-python): Support multiple cluster fileset in the Gravitino python client [gravitino]

via GitHub Sun, 28 Dec 2025 22:35:30 -0800


Copilot commented on code in PR #9570:
URL: https://github.com/apache/gravitino/pull/9570#discussion_r2650291725



##########
clients/client-python/gravitino/filesystem/gvfs_base_operations.py:
##########
@@ -489,6 +493,30 @@ def _get_actual_filesystem(
             fileset_ident, location_name
         )
 
+    def _merge_fileset_properties(
+        self,
+        catalog: FilesetCatalog,
+        schema: Schema,
+        fileset: Fileset,
+        actual_location: str,
+    ) -> Dict[str, str]:
+        """Merge properties from catalog, schema, fileset, options, and 
user-defined configs.
+        :param catalog: The fileset catalog
+        :param schema: The schema
+        :param fileset: The fileset
+        :param actual_location: The actual storage location
+        :return: Merged properties dictionary
+        """
+        fileset_props = dict(catalog.properties())
+        fileset_props.update(schema.properties())
+        fileset_props.update(fileset.properties())
+        # Get user-defined configurations for the actual location
+        user_defined_configs = self._get_user_defined_configs(actual_location)
+        if self._options:
+            fileset_props.update(self._options)
+        fileset_props.update(user_defined_configs)
+        return fileset_props

Review Comment:
   The property merging order may not be correct. Currently, the order is: 
catalog -> schema -> fileset -> options -> user-defined configs. This means 
user-defined configs (line 517) will override options (line 516). However, the 
comment in the S3StorageHandler (line 291) states "S3 endpoint from client 
options has a higher priority", suggesting options should have higher priority 
than catalog properties. The current implementation contradicts this 
expectation since user-defined configs override everything, even though they 
are derived from options with path-specific matching.



##########
catalogs/hadoop-common/src/main/java/org/apache/gravitino/catalog/hadoop/fs/FileSystemUtils.java:
##########
@@ -51,28 +44,14 @@ public class FileSystemUtils {
 
   private FileSystemUtils() {}
 
-  public static Map<String, FileSystemProvider> getFileSystemProviders(String 
fileSystemProviders) {
+  public static Map<String, FileSystemProvider> getFileSystemProviders() {
     Map<String, FileSystemProvider> resultMap = Maps.newHashMap();
     ServiceLoader<FileSystemProvider> allFileSystemProviders =
         ServiceLoader.load(FileSystemProvider.class);
 
-    Set<String> providersInUses =
-        fileSystemProviders != null
-            ? Arrays.stream(fileSystemProviders.split(","))
-                .map(f -> f.trim().toLowerCase(Locale.ROOT))
-                .collect(Collectors.toSet())
-            : Sets.newHashSet();
-
-    // Add built-in file system providers to the use list automatically.
-    providersInUses.add(BUILTIN_LOCAL_FS_PROVIDER.toLowerCase(Locale.ROOT));
-    providersInUses.add(BUILTIN_HDFS_FS_PROVIDER.toLowerCase(Locale.ROOT));
-
     // Only get the file system providers that are in the user list and check 
if the scheme is

Review Comment:
   The comment on line 52-53 is now inaccurate. It states "Only get the file 
system providers that are in the user list and check if the scheme is unique", 
but there is no longer any user list filtering since the parameter was removed. 
The comment should be updated to reflect that all available FileSystemProviders 
from the ServiceLoader are being loaded, and only uniqueness of schemes is 
being checked.
   ```suggestion
       // Load all available file system providers from the ServiceLoader and 
ensure each scheme is
   ```



##########
clients/client-python/tests/integration/integration_test_env.py:
##########
@@ -180,7 +182,10 @@ def restart_server(cls):
         if result.stderr:
             logger.info("stderr: %s", result.stderr)
 
-        if not check_gravitino_server_status():
+        succes = check_gravitino_server_status()
+        if succes:

Review Comment:
   Typo: 'succes' should be 'success'.
   ```suggestion
           success = check_gravitino_server_status()
           if success:
   ```



##########
catalogs/catalog-fileset/src/main/java/org/apache/gravitino/catalog/fileset/FilesetCatalogOperations.java:
##########
@@ -245,14 +245,9 @@ public void initialize(
     }
 
     if (!disableFSOps) {
-      String fileSystemProviders =
-          (String)
-              propertiesMetadata
-                  .catalogPropertiesMetadata()
-                  .getOrDefault(config, 
FilesetCatalogPropertiesMetadata.FILESYSTEM_PROVIDERS);
       this.fileSystemProvidersMap =
           ImmutableMap.<String, FileSystemProvider>builder()
-              
.putAll(FileSystemUtils.getFileSystemProviders(fileSystemProviders))
+              .putAll(FileSystemUtils.getFileSystemProviders())
               .build();

Review Comment:
   The `FILESYSTEM_PROVIDERS` catalog property is no longer being used in the 
code, but it's still defined in `FilesetCatalogPropertiesMetadata` and is being 
used in integration tests. This is a breaking API change that could silently 
fail for users who configure this property expecting it to filter filesystem 
providers. Either the property should be removed entirely with proper 
deprecation, or the filtering logic should be restored.



##########
clients/filesystem-hadoop3/src/main/java/org/apache/gravitino/filesystem/hadoop/GravitinoVirtualFileSystemConfiguration.java:
##########
@@ -184,7 +184,10 @@ public class GravitinoVirtualFileSystemConfiguration {
   /** The default value for whether to enable auto-creation of fileset 
location. */
   public static final boolean FS_GRAVITINO_AUTO_CREATE_LOCATION_DEFAULT = true;
 
-  /** The prefix for user-defined location configs: {@code 
fs.path.config.<locationName>=<path>}. */
+  /**
+   * The prefix for user-defined location configs: {@code
+   * fs.path.config.<locationName><property_name>=<property_value>}.

Review Comment:
   The comment format is inconsistent with the actual configuration pattern. 
The comment suggests the pattern is 
"fs.path.config.<locationName><property_name>" (missing a dot separator), but 
based on the Python implementation and tests, the actual pattern should be 
"fs.path.config.<locationName>.<property_name>" (with a dot separator). The 
comment should include the dot between locationName and property_name for 
clarity.
   ```suggestion
      * fs.path.config.<locationName>.<property_name>=<property_value>}.
   ```



##########
clients/client-python/gravitino/filesystem/gvfs_base_operations.py:
##########
@@ -754,3 +785,116 @@ def _get_fileset(self, fileset_ident: NameIdentifier):
             return fileset
         finally:
             write_lock.release()
+
+    def _get_fileset_schema(self, schema_ident: NameIdentifier):
+        """Get the schema by the schema identifier from the cache or load it 
from the server if the cache is disabled.
+        :param schema_ident: The schema identifier
+        :return: The schema
+        """
+        if not self._enable_fileset_metadata_cache:
+            catalog_ident: NameIdentifier = NameIdentifier.of(
+                schema_ident.namespace().level(0), 
schema_ident.namespace().level(1)
+            )
+            catalog: FilesetCatalog = self._get_fileset_catalog(catalog_ident)
+            return catalog.as_schemas().load_schema(schema_ident.name())
+
+        read_lock = self._schema_cache_lock.gen_rlock()
+        try:
+            read_lock.acquire()
+            cache_value: Schema = self._schema_cache.get(schema_ident)
+            if cache_value is not None:
+                return cache_value
+        finally:
+            read_lock.release()
+
+        write_lock = self._schema_cache_lock.gen_wlock()
+        try:
+            write_lock.acquire()
+            cache_value: Schema = self._schema_cache.get(schema_ident)
+            if cache_value is not None:
+                return cache_value
+
+            catalog_ident: NameIdentifier = NameIdentifier.of(
+                schema_ident.namespace().level(0), 
schema_ident.namespace().level(1)
+            )
+            catalog: FilesetCatalog = self._get_fileset_catalog(catalog_ident)
+            schema = catalog.as_schemas().load_schema(schema_ident.name())
+            self._schema_cache[schema_ident] = schema
+            return schema
+        finally:
+            write_lock.release()

Review Comment:
   The new methods `_merge_fileset_properties` and `_get_fileset_schema` are 
not covered by unit tests. While integration tests may exercise these code 
paths, unit tests would provide better coverage and faster feedback for these 
critical methods that determine property precedence and caching behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [#9312] feat (client-python): Support multiple cluster fileset in the Gravitino python client [gravitino]

Reply via email to