xloya commented on code in PR #5521:
URL: https://github.com/apache/gravitino/pull/5521#discussion_r1836455391
##########
catalogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java:
##########
@@ -581,31 +583,71 @@ public Schema alterSchema(NameIdentifier ident,
SchemaChange... changes)
@Override
public boolean dropSchema(NameIdentifier ident, boolean cascade) throws
NonEmptySchemaException {
try {
+ Namespace filesetNs =
+ NamespaceUtil.ofFileset(
+ ident.namespace().level(0), // metalake name
+ ident.namespace().level(1), // catalog name
+ ident.name() // schema name
+ );
+
+ List<FilesetEntity> filesets =
+ store.list(filesetNs, FilesetEntity.class,
Entity.EntityType.FILESET);
+ if (!filesets.isEmpty() && !cascade) {
+ throw new NonEmptySchemaException("Schema %s is not empty", ident);
+ }
+
+ // Delete all the managed filesets no matter whether the storage
location is under the
+ // schema path or not.
+ // The reason why we delete the managed fileset's storage location one
by one is because we
+ // may mis-delete the storage location of the external fileset if it
happens to be under
+ // the schema path.
+ filesets.stream()
+ .filter(f -> f.filesetType() == Fileset.Type.MANAGED)
+ .forEach(
+ f -> {
+ try {
+ Path filesetPath = new Path(f.storageLocation());
Review Comment:
I see, that makes sense. So can we parallelize the deletion process to
improve performance, such as using parallelStream or a worker thread pool?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]