Moran created HUDI-9475:
---------------------------
Summary: Fix HoodieIOException: Invalid log file name during CDC
query with multiple CDC files in a file group
Key: HUDI-9475
URL: https://issues.apache.org/jira/browse/HUDI-9475
Project: Apache Hudi
Issue Type: Bug
Components: cdc
Reporter: Moran
While using Hudi's Change Data Capture (CDC) feature, we encountered the
following exception:
{code:java}
org.apache.hudi.exception.HoodieIOException: Invalid log file name:
city=shanghai/.c25b1849-323e-499f-90d2-341936a91b37-4_20250418115102802.log.2_61444-17-491101.cdc
at
org.apache.hudi.common.fs.FSUtils.getFileVersionFromLog(FSUtils.java:433)
at
java.util.Comparator.lambda$comparingInt$7b0bb60$1(Comparator.java:490)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
at java.util.TimSort.sort(TimSort.java:220)
at java.util.Arrays.sort(Arrays.java:1512)
at
java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:348)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at
org.apache.hudi.common.table.cdc.HoodieCDCFileSplit.(HoodieCDCFileSplit.java:100)
at
org.apache.hudi.common.table.cdc.HoodieCDCFileSplit.(HoodieCDCFileSplit.java:79)
at
org.apache.hudi.common.table.cdc.HoodieCDCExtractor.parseWriteStat(HoodieCDCExtractor.java:294)
at org.apache.hudi.common.table.cdc.HoodieCDCExtractor.lambda{code}
The error originates from {*}getFileVersionFromLog(String logFileName){*},which
expects a log file name without a path. However, in the HoodieCDCFileSplit
constructor, full paths (including partition directories) are passed directly
to this method during sorting:
{code:java}
this.cdcFiles = cdcFiles.stream()
.sorted(Comparator.comparingInt(FSUtils::getFileVersionFromLog))
.collect(Collectors.toList()); {code}
Since the input includes the full partition path (e.g.,
city=shanghai/.<log_file_name>.cdc), the regex in getFileVersionFromLog()
fails, resulting in an HoodieIOException.
Importantly, {*}this issue only occurs when a file group contains multiple CDC
log files{*}, since the sort operation is only triggered in that case. If
there's only one CDC file, the sort is skipped and the error does not occur.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)