hudi-agent commented on code in PR #13427:
URL: https://github.com/apache/hudi/pull/13427#discussion_r3345309970
##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java:
##########
@@ -101,34 +110,72 @@ public class HoodieAvroHFileReader extends
HoodieAvroFileReaderBase implements H
private final Object sharedLock = new Object();
- public HoodieAvroHFileReader(Configuration hadoopConf, Path path,
CacheConfig cacheConfig) throws IOException {
- this(path, FSUtils.getFs(path.toString(), hadoopConf), hadoopConf,
cacheConfig, Option.empty());
+ public HoodieAvroHFileReader(Configuration hadoopConf, Path path) throws
IOException {
+ this(path, FSUtils.getFs(path.toString(), hadoopConf), hadoopConf,
Option.empty());
}
- public HoodieAvroHFileReader(Configuration hadoopConf, Path path,
CacheConfig cacheConfig, FileSystem fs, Option<Schema> schemaOpt) throws
IOException {
- this(path, fs, hadoopConf, cacheConfig, schemaOpt);
+ public HoodieAvroHFileReader(Configuration hadoopConf, Path path, FileSystem
fs, Option<Schema> schemaOpt) throws IOException {
+ this(path, fs, hadoopConf, schemaOpt);
}
- public HoodieAvroHFileReader(Configuration hadoopConf, Path path,
CacheConfig cacheConfig, FileSystem fs, byte[] content, Option<Schema>
schemaOpt) throws IOException {
- this(path, fs, hadoopConf, cacheConfig, schemaOpt, Option.of(content));
+ public HoodieAvroHFileReader(Configuration hadoopConf, Path path, FileSystem
fs, byte[] content, Option<Schema> schemaOpt) throws IOException {
+ this(path, fs, hadoopConf, schemaOpt, Option.of(content));
}
- public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration
hadoopConf, CacheConfig config, Option<Schema> schemaOpt) throws IOException {
- this(path, fs, hadoopConf, config, schemaOpt, Option.empty());
+ public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration
hadoopConf, Option<Schema> schemaOpt) throws IOException {
+ this(path, fs, hadoopConf, schemaOpt, Option.empty());
}
- public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration
hadoopConf, CacheConfig config, Option<Schema> schemaOpt, Option<byte[]>
content) throws IOException {
+ public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration
hadoopConf, Option<Schema> schemaOpt, Option<byte[]> content) throws
IOException {
this.path = path;
this.fs = fs;
this.hadoopConf = hadoopConf;
- this.config = config;
this.content = content;
// Shared reader is instantiated lazily.
this.sharedReader = Option.empty();
this.sharedScanner = Option.empty();
this.schema = schemaOpt.map(Lazy::eagerly)
.orElseGet(() -> Lazy.lazily(() ->
fetchSchema(getSharedHFileReader())));
+
+ synchronized (HoodieAvroHFileReader.class) {
+ // HBase 2.4+ does not allocate a block cache automatically within the
CacheConfig but requires a BlockCache
+ // instance to be passed in. This is different from HBase 1.x where
CacheConfig allocated and used a static global
+ // BlockCache instance.
+ //
https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/CacheConfig.java#L688
+
+ // BlockCache speeds up lookup from the HFile. To emulate the behavior
of HBase 1.x, we will allocate a global
+ // static global block cache here if it is enabled. The configs for the
BlockCache are described in the link
+ // below and can be passed using the hadoop configuration.
+ //
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheFactory.java
+ if (GLOBAL_HFILE_BLOCK_CACHE == null) {
+ GLOBAL_HFILE_BLOCK_CACHE =
BlockCacheFactory.createBlockCache(hadoopConf);
+ if (GLOBAL_HFILE_BLOCK_CACHE != null) {
+ LOG.info("Allocated a new global block cache for hfile readers " +
GLOBAL_HFILE_BLOCK_CACHE);
+ }
+ }
+
+ // The BlockCache keys are based on the name of the files being cached.
Within HUDI, there is a non-zero chance that
+ // two different files can have the same name. Also, when multiple
datasets are being accessed in the same JVM, there
+ // could be filename conflicts from different datasets for the MDT files
paths as they are not UUIDs and hence are
+ // not unique across datasets (timestamp may be same).
+ // To avoid conflicts, we will only enable BlockCache if the name of the
file is unique and has not been
+ // opened before.
+ final String cacheKey = path.getName();
+ boolean enableBlockCache = false;
Review Comment:
🤖 The local `enableBlockCache` is declared but never read or assigned after
initialization. Looks like leftover from a previous iteration of the logic —
could you remove it (or wire it into the actual branch decision below if that
was the intent)?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java:
##########
@@ -83,6 +87,11 @@ public class HoodieAvroHFileReader extends
HoodieAvroFileReaderBase implements H
private static final Logger LOG =
LoggerFactory.getLogger(HoodieAvroHFileReader.class);
+ // The global block cache instance used across all HFile readers
+ private static BlockCache GLOBAL_HFILE_BLOCK_CACHE;
+ // Map of filename to the filepath for opened files
+ private static Map<String, String> FILE_NAME_TO_PATH_MAP = new HashMap<>();
+
private final Path path;
Review Comment:
🤖 `FILE_NAME_TO_PATH_MAP` is a process-lifetime static map with no eviction
— every unique HFile name ever opened (including per-log-block inline paths
from `lookupRecords` now keyed by pathHash) stays here forever. For
long-running drivers/executors that touch many MDT files, log blocks, and
multi-table workloads, this can grow into the millions of entries. Have you
considered bounding it (e.g., a Guava `Cache` with size cap) or scoping it to a
session?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/common/fs/inline/InLineFSUtils.java:
##########
@@ -98,8 +105,9 @@ public static Path getOuterFilePathFromInlinePath(Path
inlineFSPath) {
public static long startOffset(Path inlineFSPath) {
assertInlineFSPath(inlineFSPath);
- String[] slices = inlineFSPath.toString().split("[?&=]");
- return Long.parseLong(slices[slices.length - 3]);
+ final Matcher matcher =
START_OFFSET_PATTERN.matcher(inlineFSPath.toString());
+ matcher.find();
Review Comment:
🤖 `matcher.find()` return value isn't checked here (and the same below on
line 123 for `length`). If the path doesn't contain `?start_offset=<digits>`
for any reason, `matcher.group(1)` will throw `IllegalStateException("No match
found")` with no path context — making diagnosis harder than the previous
split-based failure. Could you check `find()` and throw a `HoodieException`
with the offending path instead?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java:
##########
@@ -83,6 +87,11 @@ public class HoodieAvroHFileReader extends
HoodieAvroFileReaderBase implements H
private static final Logger LOG =
LoggerFactory.getLogger(HoodieAvroHFileReader.class);
+ // The global block cache instance used across all HFile readers
+ private static BlockCache GLOBAL_HFILE_BLOCK_CACHE;
+ // Map of filename to the filepath for opened files
+ private static Map<String, String> FILE_NAME_TO_PATH_MAP = new HashMap<>();
Review Comment:
🤖 nit: could you add `final` here? `private static final Map<String, String>
FILE_NAME_TO_PATH_MAP = new HashMap<>();` — without it, the field can be
accidentally reassigned, which would silently drop all the cache-conflict
tracking.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##########
@@ -197,9 +199,15 @@ protected <T> ClosableIterator<HoodieRecord<T>>
lookupRecords(List<String> sorte
blockContentLoc.getContentPositionInLogFile(),
blockContentLoc.getBlockSize());
+ // HFile BlockCache key is based on the name of the file. The inlinefs
path has the format
+ //
inlinefs://<path_to_outer_file>/<outer_file_scheme>/?start_offset=start_offset>&length=<length>
whose filename
+ // component is parsed as "?start_offset=start_offset>&length=<length>"
ignoring the actual outer file itself.
+ // To keep this unique (as cache keys should be unique for each file), we
add a hash which is calculated from the
+ // entire filename. So the cache key would now be
?start_offset=start_offset>&length=<length>&pathHash=<hash>
+ final String pathHash =
StringUtils.toHexString(HashID.hash(inlinePath.toString(),
HashID.Size.BITS_128));
+ Path cachableInlinePath = new Path(inlinePath.toString() + "&pathHash=" +
pathHash);
try (final HoodieAvroHFileReader reader =
- new HoodieAvroHFileReader(inlineConf, inlinePath, new
CacheConfig(inlineConf), inlinePath.getFileSystem(inlineConf),
- Option.of(getSchemaFromHeader()))) {
+ new HoodieAvroHFileReader(inlineConf, cachableInlinePath,
inlinePath.getFileSystem(inlineConf), Option.of(getSchemaFromHeader()))) {
// Get writer's schema from the header
Review Comment:
🤖 nit: small typo — `cachableInlinePath` should be `cacheableInlinePath`
(the correct English spelling is "cacheable").
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java:
##########
@@ -101,34 +110,72 @@ public class HoodieAvroHFileReader extends
HoodieAvroFileReaderBase implements H
private final Object sharedLock = new Object();
- public HoodieAvroHFileReader(Configuration hadoopConf, Path path,
CacheConfig cacheConfig) throws IOException {
- this(path, FSUtils.getFs(path.toString(), hadoopConf), hadoopConf,
cacheConfig, Option.empty());
+ public HoodieAvroHFileReader(Configuration hadoopConf, Path path) throws
IOException {
+ this(path, FSUtils.getFs(path.toString(), hadoopConf), hadoopConf,
Option.empty());
}
- public HoodieAvroHFileReader(Configuration hadoopConf, Path path,
CacheConfig cacheConfig, FileSystem fs, Option<Schema> schemaOpt) throws
IOException {
- this(path, fs, hadoopConf, cacheConfig, schemaOpt);
+ public HoodieAvroHFileReader(Configuration hadoopConf, Path path, FileSystem
fs, Option<Schema> schemaOpt) throws IOException {
+ this(path, fs, hadoopConf, schemaOpt);
}
- public HoodieAvroHFileReader(Configuration hadoopConf, Path path,
CacheConfig cacheConfig, FileSystem fs, byte[] content, Option<Schema>
schemaOpt) throws IOException {
- this(path, fs, hadoopConf, cacheConfig, schemaOpt, Option.of(content));
+ public HoodieAvroHFileReader(Configuration hadoopConf, Path path, FileSystem
fs, byte[] content, Option<Schema> schemaOpt) throws IOException {
+ this(path, fs, hadoopConf, schemaOpt, Option.of(content));
}
- public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration
hadoopConf, CacheConfig config, Option<Schema> schemaOpt) throws IOException {
- this(path, fs, hadoopConf, config, schemaOpt, Option.empty());
+ public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration
hadoopConf, Option<Schema> schemaOpt) throws IOException {
+ this(path, fs, hadoopConf, schemaOpt, Option.empty());
}
- public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration
hadoopConf, CacheConfig config, Option<Schema> schemaOpt, Option<byte[]>
content) throws IOException {
+ public HoodieAvroHFileReader(Path path, FileSystem fs, Configuration
hadoopConf, Option<Schema> schemaOpt, Option<byte[]> content) throws
IOException {
this.path = path;
this.fs = fs;
this.hadoopConf = hadoopConf;
- this.config = config;
this.content = content;
// Shared reader is instantiated lazily.
this.sharedReader = Option.empty();
this.sharedScanner = Option.empty();
this.schema = schemaOpt.map(Lazy::eagerly)
.orElseGet(() -> Lazy.lazily(() ->
fetchSchema(getSharedHFileReader())));
+
+ synchronized (HoodieAvroHFileReader.class) {
+ // HBase 2.4+ does not allocate a block cache automatically within the
CacheConfig but requires a BlockCache
+ // instance to be passed in. This is different from HBase 1.x where
CacheConfig allocated and used a static global
+ // BlockCache instance.
+ //
https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/CacheConfig.java#L688
+
+ // BlockCache speeds up lookup from the HFile. To emulate the behavior
of HBase 1.x, we will allocate a global
+ // static global block cache here if it is enabled. The configs for the
BlockCache are described in the link
+ // below and can be passed using the hadoop configuration.
+ //
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheFactory.java
+ if (GLOBAL_HFILE_BLOCK_CACHE == null) {
+ GLOBAL_HFILE_BLOCK_CACHE =
BlockCacheFactory.createBlockCache(hadoopConf);
+ if (GLOBAL_HFILE_BLOCK_CACHE != null) {
+ LOG.info("Allocated a new global block cache for hfile readers " +
GLOBAL_HFILE_BLOCK_CACHE);
+ }
+ }
+
+ // The BlockCache keys are based on the name of the files being cached.
Within HUDI, there is a non-zero chance that
+ // two different files can have the same name. Also, when multiple
datasets are being accessed in the same JVM, there
+ // could be filename conflicts from different datasets for the MDT files
paths as they are not UUIDs and hence are
+ // not unique across datasets (timestamp may be same).
+ // To avoid conflicts, we will only enable BlockCache if the name of the
file is unique and has not been
+ // opened before.
+ final String cacheKey = path.getName();
+ boolean enableBlockCache = false;
+ if (!FILE_NAME_TO_PATH_MAP.containsKey(cacheKey)) {
+ // Add new cache key mapping
+ FILE_NAME_TO_PATH_MAP.put(cacheKey, path.toString());
+ }
+
+ if (FILE_NAME_TO_PATH_MAP.get(cacheKey).equals(path.toString())) {
+ this.config = new CacheConfig(hadoopConf, GLOBAL_HFILE_BLOCK_CACHE);
Review Comment:
🤖 The first-wins behavior here makes caching non-deterministic: whichever
file with name `X` was opened first in the JVM gets to use the BlockCache; any
subsequent file with the same name but a different path silently has its cache
disabled (with a WARN). For workloads that open multiple datasets in one JVM
(e.g., metaserver, multi-table services), this means a file's caching behavior
depends on JVM warm-up order rather than its own attributes. Is that
intentional? Would it be more predictable to either always-disable on
collision, or to use a path-derived cache key (like the pathHash trick used in
`HoodieHFileDataBlock.lookupRecords`) for all readers?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]