Moran created HUDI-9475:
---------------------------

             Summary: Fix HoodieIOException: Invalid log file name during CDC 
query with multiple CDC files in a file group
                 Key: HUDI-9475
                 URL: https://issues.apache.org/jira/browse/HUDI-9475
             Project: Apache Hudi
          Issue Type: Bug
          Components: cdc
            Reporter: Moran


While using Hudi's Change Data Capture (CDC) feature, we encountered the 
following exception:
{code:java}
org.apache.hudi.exception.HoodieIOException: Invalid log file name: 
city=shanghai/.c25b1849-323e-499f-90d2-341936a91b37-4_20250418115102802.log.2_61444-17-491101.cdc
        at 
org.apache.hudi.common.fs.FSUtils.getFileVersionFromLog(FSUtils.java:433)
        at 
java.util.Comparator.lambda$comparingInt$7b0bb60$1(Comparator.java:490)
        at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
        at java.util.TimSort.sort(TimSort.java:220)
        at java.util.Arrays.sort(Arrays.java:1512)
        at 
java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:348)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at 
org.apache.hudi.common.table.cdc.HoodieCDCFileSplit.(HoodieCDCFileSplit.java:100)
        at 
org.apache.hudi.common.table.cdc.HoodieCDCFileSplit.(HoodieCDCFileSplit.java:79)
        at 
org.apache.hudi.common.table.cdc.HoodieCDCExtractor.parseWriteStat(HoodieCDCExtractor.java:294)
        at org.apache.hudi.common.table.cdc.HoodieCDCExtractor.lambda{code}
The error originates from {*}getFileVersionFromLog(String logFileName){*},which 
expects a log file name without a path. However, in the HoodieCDCFileSplit 
constructor, full paths (including partition directories) are passed directly 
to this method during sorting:
{code:java}
this.cdcFiles = cdcFiles.stream()
    .sorted(Comparator.comparingInt(FSUtils::getFileVersionFromLog))
    .collect(Collectors.toList()); {code}
Since the input includes the full partition path (e.g., 
city=shanghai/.<log_file_name>.cdc), the regex in getFileVersionFromLog() 
fails, resulting in an HoodieIOException.
 
Importantly, {*}this issue only occurs when a file group contains multiple CDC 
log files{*}, since the sort operation is only triggered in that case. If 
there's only one CDC file, the sort is skipped and the error does not occur.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to