vamshikrishnakyatham opened a new pull request, #13745:
URL: https://github.com/apache/hudi/pull/13745

   I notice that the `HoodieProcedureFilterUtils.scala` file has been deleted, 
which suggests this PR is now focused only on the `show_file_group_history` 
procedure work. Let me create a PR description for that specific feature:
   
   ## PR Description
   
   ### Change Logs
   
   This PR introduces a new Spark SQL procedure `show_file_group_history` that 
provides comprehensive tracking and debugging capabilities for Hudi file groups 
across their entire lifecycle. The procedure traces file group evolution 
through active and archived timelines, including all instant states (COMPLETED, 
INFLIGHT, REQUESTED).
   
   **Key Features:**
   - **Complete Timeline Coverage**: Processes both active and archived 
timelines to provide full historical view
   - **All Instant States**: Shows COMPLETED operations with full metadata, 
plus INFLIGHT/REQUESTED operations with identifiable placeholders
   - **Smart Filtering**: Supports optional specific partition filtering and 
limit-aware processing for performance
   - **Debugging-Focused Output**: Includes operation types, file statistics, 
deletion tracking, and state information
   - **Performance Optimized**: Implements limit-aware timeline processing to 
handle large datasets efficiently
   
   **Implementation Details:**
   - Added `ShowFileGroupHistoryProcedure.scala` with comprehensive file group 
tracking logic
   - Added `TestShowFileGroupHistoryProcedure.scala` with extensive test 
coverage
   - Integrated procedure registration in `HoodieProcedures.scala`
   - Supports Hive-style partition filtering (e.g., `category=electronics`)
   - Handles clean and rollback operations with proper deletion tracking
   
   ### Impact
   
   **Public API Changes:**
   - **New Procedure**: `show_file_group_history(table, file_group_id, 
[partition], [limit])`
   - **Output Schema**: 20-column result set including instant details, file 
statistics, and operation metadata
   - **User-Facing Feature**: Provides powerful debugging tool for Hudi table 
maintenance and troubleshooting
   
   **Performance Impact:**
   - **Positive**: Limit-aware processing prevents loading entire timelines 
unnecessarily
   - **Optimized**: Smart partition filtering reduces data scanning
   - **Scalable**: Handles large archived timelines efficiently through batched 
processing
   
   **Use Cases:**
   - Debug file group lifecycle issues
   - Track file evolution across commits
   - Identify cleaning and rollback impacts
   - Monitor file group health and statistics
   - Troubleshoot partition-specific problems
   
   ### Risk level: **Low**
   
   **Verification Done:**
   - Comprehensive test suite covering basic functionality, partition 
filtering, complex updates, cleaning operations, archived timeline access, and 
error handling
   - All scalastyle checks passed
   - No modifications to existing procedures or core Hudi functionality
   - Read-only operation with no data modification capabilities
   - Extensive validation of edge cases (non-existent file groups, empty 
timelines, etc.)
   
   ### Documentation Update
   
   **Required Updates:**
   - Hudi website documentation needs to be updated to include the new 
`show_file_group_history` procedure
   - **Jira Ticket**: [To be created] - Add documentation for 
show_file_group_history procedure
   - Procedure documentation includes comprehensive Scaladocs with usage 
examples, parameter descriptions, and output schema details
   
   **Documentation Content:**
   - Parameter descriptions and usage examples
   - Output schema with all 20 columns explained
   - Filter syntax and partition specification format
   - Performance considerations and best practices
   - Integration with existing Hudi debugging workflows
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [] CI passed (scalastyle checks completed successfully)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to