zhangjun0x01 commented on a change in pull request #3122:
URL: https://github.com/apache/hudi/pull/3122#discussion_r655204771
##########
File path:
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
##########
@@ -165,16 +165,20 @@
Map<String, List<HoodieBaseFile>> groupedInputSplits =
partitionsToParquetSplits.get(partitionPath).stream()
.collect(Collectors.groupingBy(file ->
FSUtils.getFileId(file.getFileStatus().getPath().getName())));
latestFileSlices.forEach(fileSlice -> {
- List<HoodieBaseFile> dataFileSplits =
groupedInputSplits.get(fileSlice.getFileId());
- dataFileSplits.forEach(split -> {
- try {
- List<String> logFilePaths =
fileSlice.getLogFiles().sorted(HoodieLogFile.getLogFileComparator())
- .map(logFile ->
logFile.getPath().toString()).collect(Collectors.toList());
- resultMap.put(split, logFilePaths);
- } catch (Exception e) {
- throw new HoodieException("Error creating hoodie real time split
", e);
- }
- });
+ final String fileId = fileSlice.getFileId();
+ // filter out the file group that has only logs (say the index is
global).
+ if (groupedInputSplits.containsKey(fileId)) {
+ List<HoodieBaseFile> dataFileSplits =
groupedInputSplits.get(fileId);
+ dataFileSplits.forEach(split -> {
+ try {
+ List<String> logFilePaths =
fileSlice.getLogFiles().sorted(HoodieLogFile.getLogFileComparator())
+ .map(logFile ->
logFile.getPath().toString()).collect(Collectors.toList());
+ resultMap.put(split, logFilePaths);
+ } catch (Exception e) {
+ throw new HoodieException("Error creating hoodie real time
split ", e);
Review comment:
nit : Should we modify the exception message so that when throw an
exception, we can distinguish it from the exception in the `getRealtimeSplits`
method?
##########
File path:
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java
##########
@@ -165,16 +165,20 @@
Map<String, List<HoodieBaseFile>> groupedInputSplits =
partitionsToParquetSplits.get(partitionPath).stream()
.collect(Collectors.groupingBy(file ->
FSUtils.getFileId(file.getFileStatus().getPath().getName())));
latestFileSlices.forEach(fileSlice -> {
- List<HoodieBaseFile> dataFileSplits =
groupedInputSplits.get(fileSlice.getFileId());
- dataFileSplits.forEach(split -> {
- try {
- List<String> logFilePaths =
fileSlice.getLogFiles().sorted(HoodieLogFile.getLogFileComparator())
- .map(logFile ->
logFile.getPath().toString()).collect(Collectors.toList());
- resultMap.put(split, logFilePaths);
- } catch (Exception e) {
- throw new HoodieException("Error creating hoodie real time split
", e);
- }
- });
+ final String fileId = fileSlice.getFileId();
+ // filter out the file group that has only logs (say the index is
global).
+ if (groupedInputSplits.containsKey(fileId)) {
+ List<HoodieBaseFile> dataFileSplits =
groupedInputSplits.get(fileId);
+ dataFileSplits.forEach(split -> {
+ try {
+ List<String> logFilePaths =
fileSlice.getLogFiles().sorted(HoodieLogFile.getLogFileComparator())
+ .map(logFile ->
logFile.getPath().toString()).collect(Collectors.toList());
+ resultMap.put(split, logFilePaths);
+ } catch (Exception e) {
+ throw new HoodieException("Error creating hoodie real time
split ", e);
Review comment:
I mean to modify this content,for example 'Error creating hoodie real
time split for group logs by BaseFile',because the exception content of
`groupLogsByBaseFile` and `getRealtimeSplits` method are the same.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]