nada-attia opened a new issue, #18134: URL: https://github.com/apache/hudi/issues/18134
### Describe the problem you faced Timeline archival is a critical operation in Hudi that manages the growth of metadata files. However, when archival operations fail or encounter issues (especially OOM errors in production), there is limited visibility into: - What caused the failure (OOM vs other exceptions) - How many commits were being processed when the failure occurred - Whether archival completed successfully or failed silently This lack of observability makes it difficult to: - Debug archival failures in production - Set appropriate heap sizes and configurations - Monitor archival health proactively - Correlate archival issues with table performance ### To Reproduce Steps to reproduce the behavior: 1. Run archival on a table with large number of commits 2. Monitor for failures or OOM errors 3. Attempt to determine root cause from existing metrics/logs **Expected behavior** Archival operations should emit detailed metrics that allow operators to: - Track successful vs failed archival runs - Identify OOM failures specifically - Monitor the volume of commits being archived - Detect patterns in archival failures ### Environment Description * Hudi version: all versions * Deployment: any (Spark, Flink, etc.) ### Solution Proposal Add comprehensive metrics emission during timeline archival: - OOM failure detection and tracking (`archivalOutOfMemory`) - Exception-specific failure tracking (`archivalFailure.<ExceptionClassName>`) - Commit count metrics (`archivalNumAllCommits`, `archivalNumWriteCommits`) - Operation status (`archivalStatus`) These metrics should be: - Emitted through the existing `HoodieMetrics` framework - Collected in both `TimelineArchiverV1` and `TimelineArchiverV2` - Available for external monitoring systems (Prometheus, CloudWatch, etc.) This will enable better production monitoring and faster issue resolution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
