vamshikrishnakyatham opened a new pull request, #13745: URL: https://github.com/apache/hudi/pull/13745
I notice that the `HoodieProcedureFilterUtils.scala` file has been deleted, which suggests this PR is now focused only on the `show_file_group_history` procedure work. Let me create a PR description for that specific feature: ## PR Description ### Change Logs This PR introduces a new Spark SQL procedure `show_file_group_history` that provides comprehensive tracking and debugging capabilities for Hudi file groups across their entire lifecycle. The procedure traces file group evolution through active and archived timelines, including all instant states (COMPLETED, INFLIGHT, REQUESTED). **Key Features:** - **Complete Timeline Coverage**: Processes both active and archived timelines to provide full historical view - **All Instant States**: Shows COMPLETED operations with full metadata, plus INFLIGHT/REQUESTED operations with identifiable placeholders - **Smart Filtering**: Supports optional specific partition filtering and limit-aware processing for performance - **Debugging-Focused Output**: Includes operation types, file statistics, deletion tracking, and state information - **Performance Optimized**: Implements limit-aware timeline processing to handle large datasets efficiently **Implementation Details:** - Added `ShowFileGroupHistoryProcedure.scala` with comprehensive file group tracking logic - Added `TestShowFileGroupHistoryProcedure.scala` with extensive test coverage - Integrated procedure registration in `HoodieProcedures.scala` - Supports Hive-style partition filtering (e.g., `category=electronics`) - Handles clean and rollback operations with proper deletion tracking ### Impact **Public API Changes:** - **New Procedure**: `show_file_group_history(table, file_group_id, [partition], [limit])` - **Output Schema**: 20-column result set including instant details, file statistics, and operation metadata - **User-Facing Feature**: Provides powerful debugging tool for Hudi table maintenance and troubleshooting **Performance Impact:** - **Positive**: Limit-aware processing prevents loading entire timelines unnecessarily - **Optimized**: Smart partition filtering reduces data scanning - **Scalable**: Handles large archived timelines efficiently through batched processing **Use Cases:** - Debug file group lifecycle issues - Track file evolution across commits - Identify cleaning and rollback impacts - Monitor file group health and statistics - Troubleshoot partition-specific problems ### Risk level: **Low** **Verification Done:** - Comprehensive test suite covering basic functionality, partition filtering, complex updates, cleaning operations, archived timeline access, and error handling - All scalastyle checks passed - No modifications to existing procedures or core Hudi functionality - Read-only operation with no data modification capabilities - Extensive validation of edge cases (non-existent file groups, empty timelines, etc.) ### Documentation Update **Required Updates:** - Hudi website documentation needs to be updated to include the new `show_file_group_history` procedure - **Jira Ticket**: [To be created] - Add documentation for show_file_group_history procedure - Procedure documentation includes comprehensive Scaladocs with usage examples, parameter descriptions, and output schema details **Documentation Content:** - Parameter descriptions and usage examples - Output schema with all 20 columns explained - Filter syntax and partition specification format - Performance considerations and best practices - Integration with existing Hudi debugging workflows ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [] CI passed (scalastyle checks completed successfully) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
