vamshikrishnakyatham commented on code in PR #13793:
URL: https://github.com/apache/hudi/pull/13793#discussion_r2316721561
##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowCleansProcedure.scala:
##########
@@ -17,53 +17,80 @@
package org.apache.spark.sql.hudi.command.procedures
-import org.apache.hudi.{HoodieCLIUtils, SparkAdapterSupport}
-import org.apache.hudi.common.table.timeline.{HoodieInstant, HoodieTimeline,
TimelineLayout}
+import org.apache.hudi.SparkAdapterSupport
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.common.table.timeline.{HoodieInstant, HoodieTimeline}
import org.apache.spark.internal.Logging
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField,
StructType}
-import java.util
-import java.util.Collections
import java.util.function.Supplier
import scala.collection.JavaConverters._
/**
- * Spark SQL procedure to show completed clean operations for a Hudi table.
+ * Spark SQL procedure to show all clean operations for a Hudi table.
*
- * This procedure displays information about clean operations that have been
executed.
- * Clean operations remove old file versions to reclaim storage space and
maintain table performance.
+ * This procedure provides a comprehensive view of Hudi clean operations.
+ * It displays completed clean operations with full partition metadata for
both completed and pending operations.
*
* == Parameters ==
* - `table`: Required. The name of the Hudi table to query
- * - `limit`: Optional. Maximum number of clean operations to return (default:
10)
+ * - `path`: Optional. The path of the Hudi table (anyone of `table` or `path`
must be provided)
+ * - `limit`: Optional. Maximum number of clean operations to return (default:
10, ignored if time range specified)
* - `showArchived`: Optional. Whether to include archived clean operations
(default: false)
* - `filter`: Optional. SQL expression to filter results (default: empty
string)
+ * - `startTime`: Optional. Start time for clean operations (format:
yyyyMMddHHmmss, default: empty)
+ * - `endTime`: Optional. End time for clean operations (format:
yyyyMMddHHmmss, default: empty)
*
* == Output Schema ==
* - `clean_time`: Timestamp when the clean operation was performed
* - `state_transition_time`: Time when the clean transitioned to completed
state
+ * - `state`: Operation state (COMPLETED, INFLIGHT, REQUESTED)
* - `action`: The action type (always 'clean')
* - `start_clean_time`: When the clean operation started
+ * - `partition_path`: Partition path for the clean operation
+ * - `policy`: Clean policy used (KEEP_LATEST_COMMITS, etc.)
+ * - `delete_path_patterns`: Number of delete path patterns
+ * - `success_delete_files`: Number of successfully deleted files
+ * - `failed_delete_files`: Number of files that failed to delete
+ * - `is_partition_deleted`: Whether the entire partition was deleted
* - `time_taken_in_millis`: Duration of the clean operation in milliseconds
* - `total_files_deleted`: Total number of files deleted during the clean
* - `earliest_commit_to_retain`: The earliest commit that was retained
* - `last_completed_commit_timestamp`: The last completed commit at clean time
* - `version`: Version of the clean operation metadata
- * - Additional partition-level metadata columns when using
`show_cleans_metadata`
+ * - `total_partitions_to_clean`: Total partitions to clean (for pending
operations)
+ * - `total_partitions_to_delete`: Total partitions to delete (for pending
operations)
*
- * == Error Handling ==
- * - Throws `IllegalArgumentException` for invalid filter expressions
- * - Throws `HoodieException` for table access issues
- * - Returns empty result set if no clean plans match the criteria
+ * == Data Availability by Operation State ==
+ * - **COMPLETED operations**: All execution and partition metadata fields are
populated
+ * - **PENDING operations**: Plan fields are populated, execution fields are
null (graceful handling)
*
* == Filter Support ==
* The `filter` parameter supports SQL expressions for filtering results.
*
* === Common Filter Examples ===
* {{{
+ * -- Show only completed operations (equivalent to old show_cleans)
+ * CALL show_cleans(
+ * table => 'my_table',
+ * filter => "state = 'COMPLETED'"
+ * )
+ *
+ * -- Show only pending operations (equivalent to old show_clean_plans)
+ * CALL show_cleans(
+ * table => 'my_table',
+ * filter => "state IN ('REQUESTED', 'INFLIGHT')"
Review Comment:
these are sql like filters, we support lower, upper functions to support
case sensitivity.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]