the-other-tim-brown commented on code in PR #10344:
URL: https://github.com/apache/hudi/pull/10344#discussion_r1432733293
##########
hudi-common/src/main/java/org/apache/hudi/common/util/collection/ExternalSpillableMap.java:
##########
@@ -78,41 +78,49 @@ public class ExternalSpillableMap<T extends Serializable, R
extends Serializable
// Enables compression of values stored in disc
private final boolean isCompressionEnabled;
// current space occupied by this map in-memory
- private Long currentInMemoryMapSize;
+ private long currentInMemoryMapSize;
// An estimate of the size of each payload written to this map
private volatile long estimatedPayloadSize = 0;
// Base File Path
private final String baseFilePath;
- public ExternalSpillableMap(Long maxInMemorySizeInBytes, String
baseFilePath, SizeEstimator<T> keySizeEstimator,
+ public ExternalSpillableMap(long maxInMemorySizeInBytes, String
baseFilePath, SizeEstimator<T> keySizeEstimator,
SizeEstimator<R> valueSizeEstimator) throws
IOException {
this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator,
valueSizeEstimator, DiskMapType.BITCASK);
}
- public ExternalSpillableMap(Long maxInMemorySizeInBytes, String
baseFilePath, SizeEstimator<T> keySizeEstimator,
+ public ExternalSpillableMap(long maxInMemorySizeInBytes, String
baseFilePath, SizeEstimator<T> keySizeEstimator,
SizeEstimator<R> valueSizeEstimator, DiskMapType
diskMapType) throws IOException {
this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator,
valueSizeEstimator, diskMapType, false);
}
- public ExternalSpillableMap(Long maxInMemorySizeInBytes, String
baseFilePath, SizeEstimator<T> keySizeEstimator,
+ public ExternalSpillableMap(long maxInMemorySizeInBytes, String
baseFilePath, SizeEstimator<T> keySizeEstimator,
SizeEstimator<R> valueSizeEstimator, DiskMapType
diskMapType, boolean isCompressionEnabled) throws IOException {
this.inMemoryMap = new HashMap<>();
this.baseFilePath = baseFilePath;
- this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes *
sizingFactorForInMemoryMap);
+ this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes *
SIZING_FACTOR_FOR_IN_MEMORY_MAP);
this.currentInMemoryMapSize = 0L;
this.keySizeEstimator = keySizeEstimator;
this.valueSizeEstimator = valueSizeEstimator;
this.diskMapType = diskMapType;
this.isCompressionEnabled = isCompressionEnabled;
}
+ private DiskMap<T, R> getDiskBasedMap() {
+ return getDiskBasedMap(false);
+ }
+
+ private DiskMap<T, R> getOrCreateDiskBasedMap() {
+ return getDiskBasedMap(true);
+ }
+
private DiskMap<T, R> getDiskBasedMap(boolean forceInitialization) {
if (null == diskBasedMap) {
- if (!forceInitialization) {
- return DiskMap.empty();
- }
synchronized (this) {
if (null == diskBasedMap) {
+ if (!forceInitialization) {
+ return DiskMap.empty();
Review Comment:
Ok this conversation is going in circles. Can you and @nsivabalan align on
the semantics you want? Previously I made all the calls explicit about whether
they wanted to force initialization of the map and then I was asked to change
that. I think it makes more sense to make it explicit personally since these
are all internal methods so there doesn't need to be another layer of
abstraction here that can cause confusion.
@danny0405 I think you are missing some of the details here. The outside
caller does not need to worry since all the implementation details are hidden
from them as they should be. The calls in `put` will only instantiate the map
when required. You don't need to instantiate a map to check if a key exists for
example. If the map is not instantiated previously, the key will not be present
in it.
The cost of instantiating a map on every use is much more expensive. You are
doing IO and allocating more objects. If you use RocksDB as your map type,
you're going to create column family and all sorts of resources
([source](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/collection/RocksDBDAO.java#L91)).
If you use BitCaskDiskMap you are creating file handles and more objects which
is lighter weight but still non-zero impact.
I believe this code was intended to have a lazy initialization but was just
not implemented correctly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]