codope commented on code in PR #9553:
URL: https://github.com/apache/hudi/pull/9553#discussion_r1314261872
##########
hudi-common/src/main/java/org/apache/hudi/common/data/HoodieListPairData.java:
##########
@@ -191,6 +191,30 @@ public <W> HoodiePairData<K, Pair<V, Option<W>>>
leftOuterJoin(HoodiePairData<K,
return new HoodieListPairData<>(leftOuterJoined, lazy);
}
+ @Override
+ public <W> HoodiePairData<K, Pair<V, W>> join(HoodiePairData<K, W> other) {
+ ValidationUtils.checkArgument(other instanceof HoodieListPairData);
+
+ // Transform right-side container to a multi-map of [[K]] to [[List<W>]]
values
+ HashMap<K, List<W>> rightStreamMap = ((HoodieListPairData<K, W>)
other).asStream().collect(
+ Collectors.groupingBy(
+ Pair::getKey,
+ HashMap::new,
+ Collectors.mapping(Pair::getValue, Collectors.toList())));
Review Comment:
Here, we're converting the right-side of the join (`other`) into a Stream,
and then using the collect method to aggregate this stream into a HashMap
(`rightStreamMap`). This map holds all keys and associated values of the right
side in memory. If the `other` dataset is large, this could lead to significant
memory usage. Maybe just the keys can be held in-memory for presence check.
Something like below:
```
public <W> HoodiePairData<K, Pair<V, W>> join(HoodiePairData<K, W> other) {
ValidationUtils.checkArgument(other instanceof HoodieListPairData);
// Transform right-side container to a multi-map of [[K]] to [[List<W>]]
values
Map<K, List<W>> rightStreamMap = ((HoodieListPairData<K, W>)
other).asStream().collect(
Collectors.groupingBy(
Pair::getKey,
Collectors.mapping(Pair::getValue, Collectors.toList())));
List<Pair<K, Pair<V, W>>> joinResult = new ArrayList<>();
asStream().forEach(pair -> {
K key = pair.getKey();
V leftValue = pair.getValue();
List<W> rightValues = rightStreamMap.getOrDefault(key,
Collections.emptyList());
for (W rightValue : rightValues) {
joinResult.add(Pair.of(key, Pair.of(leftValue, rightValue)));
}
});
return new HoodieListPairData<>(joinResult.stream(), lazy);
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]