adarshsaraogi commented on PR #10241:
URL: https://github.com/apache/hudi/pull/10241#issuecomment-3873362040
Hi @yihua
Environment
Engine: Flink on EMR (YARN)
Hudi versions seen with this issue:
hudi-flink1.20-bundle-0.15.0-amzn-5.jar
hudi-flink-bundle-1.0.2-amzn-1.jar (same behavior)
Metadata Table (MDT): enabled and required
Stack trace (redacted paths)
During MDT compaction / log file reads, we intermittently see this failure:
```
2026-01-05 10:31:15,778 ERROR
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader [] - Got
exception when reading log file
org.apache.hudi.exception.HoodieIOException: Failed to initialize HFile
reader for
s3://<REDACTED-BUCKET>/hudi/bronze/<REDACTED-DOMAIN>/<REDACTED-TABLE>/.hoodie/metadata/files/.files-0000-0_20260105080204267001.log.1_0-1-1
at
org.apache.hudi.io.hadoop.HoodieHFileUtils.createHFileReader(HoodieHFileUtils.java:121)
at
org.apache.hudi.io.hadoop.HoodieHBaseAvroHFileReader.getHFileReader(HoodieHBaseAvroHFileReader.java:276)
...
Caused by:
org.apache.hudi.org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem
reading HFile Trailer from file
s3://<REDACTED-BUCKET>/hudi/bronze/<REDACTED-DOMAIN>/<REDACTED-TABLE>/.hoodie/metadata/files/.files-0000-0_20260105080204267001.log.1_0-1-1
at
org.apache.hudi.org.apache.hadoop.hbase.io.hfile.HFileInfo.initTrailerAndContext(HFileInfo.java:349)
...
Caused by: java.lang.ExceptionInInitializerError
at
org.apache.hudi.org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:404)
...
Caused by: java.lang.RuntimeException: Could not create interface
org.apache.hudi.org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactory
Is the hadoop compatibility jar on the classpath?
at
org.apache.hudi.org.apache.hadoop.hbase.CompatibilitySingletonFactory.getInstance(CompatibilitySingletonFactory.java:74)
...
Caused by: java.util.NoSuchElementException
at java.util.ServiceLoader$2.next(ServiceLoader.java:1318)
...
```
In the same time window we also see:
```
WARN org.apache.hadoop.metrics2.util.MBeans - Error creating MBean object
name: Hadoop:service=s3a-file-system,name=MetricsSystem,sub=Stats
org.apache.hadoop.metrics2.MetricsException: ... already exists!
ERROR org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Got
exception when reading log file
Caused by: org.apache.hadoop.fs.PathIOException:
s3://<REDACTED-ARCHIVE-BUCKET>: FileSystem is closed!
at
org.apache.hadoop.fs.s3a.S3AFileSystem.checkNotClosed(S3AFileSystem.java:4925)
```
The job eventually dies with Flink/YARN shutting down the cluster entrypoint.
**AWS internal analysis (Summary)**
From AWS’s internal support investigation on our EMR cluster:
- The CorruptHFileException is misleading – the files themselves are not
corrupt.
- The actual root cause is a failure during HBase metrics initialization:
- ServiceLoader cannot create MetricsRegionServerSourceFactory
- Reported as a classpath / “hadoop compatibility jar” issue
- This failure cascades into:
- metrics system conflicts (MBeans ... already exists)
- premature `S3A FileSystem` is closed errors
- Flink job failure during MDT compaction
- Suggested mitigations were:
- disabling MDT, or
- explicitly adjusting the classpath to include HBase compatibility jars
MDT is critical for us, so disabling it permanently is not an option.
**What we verified ourselves**
1. ServiceLoader configuration and implementation are present
On the running EMR cluster, for the jar Flink uses:
```
ls -l /usr/lib/hudi/hudi-flink-bundle.jar
# -> hudi-flink1.20-bundle-0.15.0-amzn-5.jar
# (symlinked into /usr/lib/flink/lib)
jar -tf hudi-flink-bundle.jar | grep MetricsRegionServerSourceFactory
```
This confirms the presence of:
-
META-INF/services/org.apache.hudi.org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactory
-
org/apache/hudi/org/apache/hadoop/hbase/regionserver/MetricsRegionServerSourceFactory.class
-
org/apache/hudi/org/apache/hadoop/hbase/regionserver/MetricsRegionServerSourceFactoryImpl.class
So this does not appear to be a simple “missing META-INF/services file”
problem.
**2. Only one Hudi Flink bundle is used at runtime**
```
/usr/lib/hudi/hudi-flink-bundle.jar
/usr/lib/flink/lib/hudi-flink-bundle.jar
```
are symlinks to the same physical jar
`(hudi-flink1.20-bundle-0.15.0-amzn-5.jar).
`
We are not mixing multiple Hudi bundles.
**3. The issue is tightly coupled to MDT**
- With MDT enabled, we see this failure during MDT compaction / log reading.
- With MDT disabled (hoodie.metadata.enable=false), the job runs
successfully (but loses MDT benefits).
- In some cases, the issue self-resolves after a restart with no
configuration change, suggesting fragile metrics/MBean initialization rather
than persistent data or packaging corruption.
**Relation to HUDI-7170 / PR #10241**
Our understanding is that HUDI-7170 / PR #10241 introduced a Hudi-native
HFile reader to reduce or eliminate HBase dependencies.
However, on EMR using the *-amzn-* Flink bundles:
- MDT in Flink still appears to initialize HBase-based HFile reader and
metrics
- This HBase metrics initialization path seems fragile in Flink environments
- The failure manifests as MetricsRegionServerSourceFactory / ServiceLoader
errors even though the provider and service files exist
From our perspective:
- The issue is not corrupt data
- The issue is not missing service configuration
- It appears to be a HBase-metrics-in-Flink initialization problem on the
MDT HFile read path
**Discussion / Guidance Requested from the Hudi Community**
We would appreciate guidance from the Hudi community on the following
points, especially for Flink + MDT deployments:
- What is the intended execution path for MDT HFile reads in Flink in Hudi
0.15.x and 1.x?
- Is MDT expected to still initialize HBase metrics
(MetricsRegionServerSourceFactory)?
- Or should the native HFile reader fully bypass HBase (including
metrics) in Flink runtimes?
- Are there recommended configurations or flags to:
- force MDT to use the native HFile reader path, or
- disable / avoid HBase metrics initialization when running under Flink?
- For users running on EMR, what is the recommended artifact to use:
- the shaded hudi-flink*-bundle-*-amzn-*.jar, or
- an upstream / lighter Flink bundle that avoids HBase metrics dependencies
for MDT?
- Is the observed behavior — where failures sometimes disappear after a
restart — consistent with any known issues around:
- Hadoop / HBase metrics system initialization,
- MBean registration conflicts,
- or ServiceLoader / classloader ordering in Flink environments?
Any guidance, confirmation of expected behavior, or pointers to ongoing /
planned fixes would be extremely helpful.
**Conclusion**
This issue does not appear to be caused by:
- missing META-INF/services entries
- corrupt HFiles
- or multiple conflicting Hudi Flink bundles on the classpath
Instead, it appears to be a fragility in HBase metrics initialization when
MDT HFile reads occur inside Flink, even on Hudi versions that include the
native HFile reader work.
An additional operational challenge is that the issue sometimes resolves on
its own:
- After a Flink job or YARN application restart, ingestion resumes
automatically without any configuration change.
- However, on EMR, cluster and job recovery can take hours, causing
prolonged ingestion downtime even though no data repair is required.
Because MDT is critical for our workloads, and because recovery time is
significant, we are looking for a stable and supported way to run Flink + MDT
without relying on fragile HBase metrics initialization paths.
Any clarification on the intended design, recommended jars, or best
practices for this setup would be greatly appreciated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]