Re: [PR] Added BlockCache to HFile reader. [hudi]

via GitHub Tue, 02 Jun 2026 17:52:50 -0700


hudi-agent commented on code in PR #13427:
URL: https://github.com/apache/hudi/pull/13427#discussion_r3345309970



##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java:
##########
@@ -101,34 +110,72 @@ public class HoodieAvroHFileReader extends 
HoodieAvroFileReaderBase implements H
 
   private final Object sharedLock = new Object();
 
-  public HoodieAvroHFileReader(Configuration hadoopConf, Path path, 
CacheConfig cacheConfig) throws IOException {
-    this(path, FSUtils.getFs(path.toString(), hadoopConf), hadoopConf, 
cacheConfig, Option.empty());
+  public HoodieAvroHFileReader(Configuration hadoopConf, Path path) throws 
IOException {
+    this(path, FSUtils.getFs(path.toString(), hadoopConf), hadoopConf, 
Option.empty());
   }
 
-  public HoodieAvroHFileReader(Configuration hadoopConf, Path path, 
CacheConfig cacheConfig, FileSystem fs, Option<Schema> schemaOpt) throws 
IOException {
-    this(path, fs, hadoopConf, cacheConfig, schemaOpt);
+  public HoodieAvroHFileReader(Configuration hadoopConf, Path path, FileSystem 
fs, Option<Schema> schemaOpt) throws IOException {
+    this(path, fs, hadoopConf, schemaOpt);
   }
 
-  public HoodieAvroHFileReader(Configuration hadoopConf, Path path, 
CacheConfig cacheConfig, FileSystem fs, byte[] content, Option<Schema> 
schemaOpt) throws IOException {
-    this(path, fs, hadoopConf, cacheConfig, schemaOpt, Option.of(content));
+  public HoodieAvroHFileReader(Configuration hadoopConf, Path path, FileSystem 
fs, byte[] content, Option<Schema> schemaOpt) throws IOException {
+    this(path, fs, hadoopConf, schemaOpt, Option.of(content));
   }
 
-  public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration 
hadoopConf, CacheConfig config, Option<Schema> schemaOpt) throws IOException {
-    this(path, fs, hadoopConf, config, schemaOpt, Option.empty());
+  public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration 
hadoopConf, Option<Schema> schemaOpt) throws IOException {
+    this(path, fs, hadoopConf, schemaOpt, Option.empty());
   }
 
-  public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration 
hadoopConf, CacheConfig config, Option<Schema> schemaOpt, Option<byte[]> 
content) throws IOException {
+  public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration 
hadoopConf, Option<Schema> schemaOpt, Option<byte[]> content) throws 
IOException {
     this.path = path;
     this.fs = fs;
     this.hadoopConf = hadoopConf;
-    this.config = config;
     this.content = content;
 
     // Shared reader is instantiated lazily.
     this.sharedReader = Option.empty();
     this.sharedScanner = Option.empty();
     this.schema = schemaOpt.map(Lazy::eagerly)
         .orElseGet(() -> Lazy.lazily(() -> 
fetchSchema(getSharedHFileReader())));
+
+    synchronized (HoodieAvroHFileReader.class) {
+      // HBase 2.4+ does not allocate a block cache automatically within the 
CacheConfig but requires a BlockCache
+      // instance to be passed in. This is different from HBase 1.x where 
CacheConfig allocated and used a static global
+      // BlockCache instance.
+      // 
https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/CacheConfig.java#L688
+
+      // BlockCache speeds up lookup from the HFile. To emulate the behavior 
of HBase 1.x, we will allocate a global
+      // static global block cache here if it is enabled. The configs for the 
BlockCache are described in the link
+      // below and can be passed using the hadoop configuration.
+      // 
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheFactory.java
+      if (GLOBAL_HFILE_BLOCK_CACHE == null) {
+        GLOBAL_HFILE_BLOCK_CACHE = 
BlockCacheFactory.createBlockCache(hadoopConf);
+        if (GLOBAL_HFILE_BLOCK_CACHE != null) {
+          LOG.info("Allocated a new global block cache for hfile readers " + 
GLOBAL_HFILE_BLOCK_CACHE);
+        }
+      }
+
+      // The BlockCache keys are based on the name of the files being cached. 
Within HUDI, there is a non-zero chance that
+      // two different files can have the same name. Also, when multiple 
datasets are being accessed in the same JVM, there
+      // could be filename conflicts from different datasets for the MDT files 
paths as they are not UUIDs and hence are
+      // not unique across datasets (timestamp may be same).
+      // To avoid conflicts, we will only enable BlockCache if the name of the 
file is unique and has not been
+      // opened before.
+      final String cacheKey = path.getName();
+      boolean enableBlockCache = false;

Review Comment:
   🤖 The local `enableBlockCache` is declared but never read or assigned after 
initialization. Looks like leftover from a previous iteration of the logic — 
could you remove it (or wire it into the actual branch decision below if that 
was the intent)?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java:
##########
@@ -83,6 +87,11 @@ public class HoodieAvroHFileReader extends 
HoodieAvroFileReaderBase implements H
 
   private static final Logger LOG = 
LoggerFactory.getLogger(HoodieAvroHFileReader.class);
 
+  // The global block cache instance used across all HFile readers
+  private static BlockCache GLOBAL_HFILE_BLOCK_CACHE;
+  // Map of filename to the filepath for opened files
+  private static Map<String, String> FILE_NAME_TO_PATH_MAP = new HashMap<>();
+
   private final Path path;

Review Comment:
   🤖 `FILE_NAME_TO_PATH_MAP` is a process-lifetime static map with no eviction 
— every unique HFile name ever opened (including per-log-block inline paths 
from `lookupRecords` now keyed by pathHash) stays here forever. For 
long-running drivers/executors that touch many MDT files, log blocks, and 
multi-table workloads, this can grow into the millions of entries. Have you 
considered bounding it (e.g., a Guava `Cache` with size cap) or scoping it to a 
session?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/fs/inline/InLineFSUtils.java:
##########
@@ -98,8 +105,9 @@ public static Path getOuterFilePathFromInlinePath(Path 
inlineFSPath) {
   public static long startOffset(Path inlineFSPath) {
     assertInlineFSPath(inlineFSPath);
 
-    String[] slices = inlineFSPath.toString().split("[?&=]");
-    return Long.parseLong(slices[slices.length - 3]);
+    final Matcher matcher = 
START_OFFSET_PATTERN.matcher(inlineFSPath.toString());
+    matcher.find();

Review Comment:
   🤖 `matcher.find()` return value isn't checked here (and the same below on 
line 123 for `length`). If the path doesn't contain `?start_offset=<digits>` 
for any reason, `matcher.group(1)` will throw `IllegalStateException("No match 
found")` with no path context — making diagnosis harder than the previous 
split-based failure. Could you check `find()` and throw a `HoodieException` 
with the offending path instead?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java:
##########
@@ -83,6 +87,11 @@ public class HoodieAvroHFileReader extends 
HoodieAvroFileReaderBase implements H
 
   private static final Logger LOG = 
LoggerFactory.getLogger(HoodieAvroHFileReader.class);
 
+  // The global block cache instance used across all HFile readers
+  private static BlockCache GLOBAL_HFILE_BLOCK_CACHE;
+  // Map of filename to the filepath for opened files
+  private static Map<String, String> FILE_NAME_TO_PATH_MAP = new HashMap<>();

Review Comment:
   🤖 nit: could you add `final` here? `private static final Map<String, String> 
FILE_NAME_TO_PATH_MAP = new HashMap<>();` — without it, the field can be 
accidentally reassigned, which would silently drop all the cache-conflict 
tracking.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##########
@@ -197,9 +199,15 @@ protected <T> ClosableIterator<HoodieRecord<T>> 
lookupRecords(List<String> sorte
         blockContentLoc.getContentPositionInLogFile(),
         blockContentLoc.getBlockSize());
 
+    // HFile BlockCache key is based on the name of the file. The inlinefs 
path has the format
+    // 
inlinefs://<path_to_outer_file>/<outer_file_scheme>/?start_offset=start_offset>&length=<length>
 whose filename
+    // component is parsed as "?start_offset=start_offset>&length=<length>" 
ignoring the actual outer file itself.
+    // To keep this unique (as cache keys should be unique for each file), we 
add a hash which is calculated from the
+    // entire filename. So the cache key would now be 
?start_offset=start_offset>&length=<length>&pathHash=<hash>
+    final String pathHash = 
StringUtils.toHexString(HashID.hash(inlinePath.toString(), 
HashID.Size.BITS_128));
+    Path cachableInlinePath = new Path(inlinePath.toString() + "&pathHash=" + 
pathHash);
     try (final HoodieAvroHFileReader reader =
-             new HoodieAvroHFileReader(inlineConf, inlinePath, new 
CacheConfig(inlineConf), inlinePath.getFileSystem(inlineConf),
-             Option.of(getSchemaFromHeader()))) {
+             new HoodieAvroHFileReader(inlineConf, cachableInlinePath, 
inlinePath.getFileSystem(inlineConf), Option.of(getSchemaFromHeader()))) {
       // Get writer's schema from the header

Review Comment:
   🤖 nit: small typo — `cachableInlinePath` should be `cacheableInlinePath` 
(the correct English spelling is "cacheable").
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java:
##########
@@ -101,34 +110,72 @@ public class HoodieAvroHFileReader extends 
HoodieAvroFileReaderBase implements H
 
   private final Object sharedLock = new Object();
 
-  public HoodieAvroHFileReader(Configuration hadoopConf, Path path, 
CacheConfig cacheConfig) throws IOException {
-    this(path, FSUtils.getFs(path.toString(), hadoopConf), hadoopConf, 
cacheConfig, Option.empty());
+  public HoodieAvroHFileReader(Configuration hadoopConf, Path path) throws 
IOException {
+    this(path, FSUtils.getFs(path.toString(), hadoopConf), hadoopConf, 
Option.empty());
   }
 
-  public HoodieAvroHFileReader(Configuration hadoopConf, Path path, 
CacheConfig cacheConfig, FileSystem fs, Option<Schema> schemaOpt) throws 
IOException {
-    this(path, fs, hadoopConf, cacheConfig, schemaOpt);
+  public HoodieAvroHFileReader(Configuration hadoopConf, Path path, FileSystem 
fs, Option<Schema> schemaOpt) throws IOException {
+    this(path, fs, hadoopConf, schemaOpt);
   }
 
-  public HoodieAvroHFileReader(Configuration hadoopConf, Path path, 
CacheConfig cacheConfig, FileSystem fs, byte[] content, Option<Schema> 
schemaOpt) throws IOException {
-    this(path, fs, hadoopConf, cacheConfig, schemaOpt, Option.of(content));
+  public HoodieAvroHFileReader(Configuration hadoopConf, Path path, FileSystem 
fs, byte[] content, Option<Schema> schemaOpt) throws IOException {
+    this(path, fs, hadoopConf, schemaOpt, Option.of(content));
   }
 
-  public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration 
hadoopConf, CacheConfig config, Option<Schema> schemaOpt) throws IOException {
-    this(path, fs, hadoopConf, config, schemaOpt, Option.empty());
+  public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration 
hadoopConf, Option<Schema> schemaOpt) throws IOException {
+    this(path, fs, hadoopConf, schemaOpt, Option.empty());
   }
 
-  public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration 
hadoopConf, CacheConfig config, Option<Schema> schemaOpt, Option<byte[]> 
content) throws IOException {
+  public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration 
hadoopConf, Option<Schema> schemaOpt, Option<byte[]> content) throws 
IOException {
     this.path = path;
     this.fs = fs;
     this.hadoopConf = hadoopConf;
-    this.config = config;
     this.content = content;
 
     // Shared reader is instantiated lazily.
     this.sharedReader = Option.empty();
     this.sharedScanner = Option.empty();
     this.schema = schemaOpt.map(Lazy::eagerly)
         .orElseGet(() -> Lazy.lazily(() -> 
fetchSchema(getSharedHFileReader())));
+
+    synchronized (HoodieAvroHFileReader.class) {
+      // HBase 2.4+ does not allocate a block cache automatically within the 
CacheConfig but requires a BlockCache
+      // instance to be passed in. This is different from HBase 1.x where 
CacheConfig allocated and used a static global
+      // BlockCache instance.
+      // 
https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/CacheConfig.java#L688
+
+      // BlockCache speeds up lookup from the HFile. To emulate the behavior 
of HBase 1.x, we will allocate a global
+      // static global block cache here if it is enabled. The configs for the 
BlockCache are described in the link
+      // below and can be passed using the hadoop configuration.
+      // 
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheFactory.java
+      if (GLOBAL_HFILE_BLOCK_CACHE == null) {
+        GLOBAL_HFILE_BLOCK_CACHE = 
BlockCacheFactory.createBlockCache(hadoopConf);
+        if (GLOBAL_HFILE_BLOCK_CACHE != null) {
+          LOG.info("Allocated a new global block cache for hfile readers " + 
GLOBAL_HFILE_BLOCK_CACHE);
+        }
+      }
+
+      // The BlockCache keys are based on the name of the files being cached. 
Within HUDI, there is a non-zero chance that
+      // two different files can have the same name. Also, when multiple 
datasets are being accessed in the same JVM, there
+      // could be filename conflicts from different datasets for the MDT files 
paths as they are not UUIDs and hence are
+      // not unique across datasets (timestamp may be same).
+      // To avoid conflicts, we will only enable BlockCache if the name of the 
file is unique and has not been
+      // opened before.
+      final String cacheKey = path.getName();
+      boolean enableBlockCache = false;
+      if (!FILE_NAME_TO_PATH_MAP.containsKey(cacheKey)) {
+        // Add new cache key mapping
+        FILE_NAME_TO_PATH_MAP.put(cacheKey, path.toString());
+      }
+
+      if (FILE_NAME_TO_PATH_MAP.get(cacheKey).equals(path.toString())) {
+        this.config = new CacheConfig(hadoopConf, GLOBAL_HFILE_BLOCK_CACHE);

Review Comment:
   🤖 The first-wins behavior here makes caching non-deterministic: whichever 
file with name `X` was opened first in the JVM gets to use the BlockCache; any 
subsequent file with the same name but a different path silently has its cache 
disabled (with a WARN). For workloads that open multiple datasets in one JVM 
(e.g., metaserver, multi-table services), this means a file's caching behavior 
depends on JVM warm-up order rather than its own attributes. Is that 
intentional? Would it be more predictable to either always-disable on 
collision, or to use a path-derived cache key (like the pathHash trick used in 
`HoodieHFileDataBlock.lookupRecords`) for all readers?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Added BlockCache to HFile reader. [hudi]

Reply via email to