[ 
https://issues.apache.org/jira/browse/HUDI-9475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moran updated HUDI-9475:
------------------------
    Summary: Fix invalid log file name exception during CDC query  (was: Fix 
Invalid log file name exception during CDC query)

> Fix invalid log file name exception during CDC query
> ----------------------------------------------------
>
>                 Key: HUDI-9475
>                 URL: https://issues.apache.org/jira/browse/HUDI-9475
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: cdc
>            Reporter: Moran
>            Priority: Major
>
> While using Hudi's Change Data Capture (CDC) feature, we encountered the 
> following exception:
> {code:java}
> org.apache.hudi.exception.HoodieIOException: Invalid log file name: 
> city=shanghai/.c25b1849-323e-499f-90d2-341936a91b37-4_20250418115102802.log.2_61444-17-491101.cdc
>       at 
> org.apache.hudi.common.fs.FSUtils.getFileVersionFromLog(FSUtils.java:433)
>       at 
> java.util.Comparator.lambda$comparingInt$7b0bb60$1(Comparator.java:490)
>       at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>       at java.util.TimSort.sort(TimSort.java:220)
>       at java.util.Arrays.sort(Arrays.java:1512)
>       at 
> java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:348)
>       at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>       at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
>       at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>       at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>       at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>       at 
> org.apache.hudi.common.table.cdc.HoodieCDCFileSplit.(HoodieCDCFileSplit.java:100)
>       at 
> org.apache.hudi.common.table.cdc.HoodieCDCFileSplit.(HoodieCDCFileSplit.java:79)
>       at 
> org.apache.hudi.common.table.cdc.HoodieCDCExtractor.parseWriteStat(HoodieCDCExtractor.java:294)
>       at org.apache.hudi.common.table.cdc.HoodieCDCExtractor.lambda{code}
> The error originates from {*}getFileVersionFromLog(String 
> logFileName){*},which expects a log file name without a path. However, in the 
> HoodieCDCFileSplit constructor, full paths (including partition directories) 
> are passed directly to this method during sorting:
> {code:java}
> this.cdcFiles = cdcFiles.stream()
>     .sorted(Comparator.comparingInt(FSUtils::getFileVersionFromLog))
>     .collect(Collectors.toList()); {code}
> Since the input includes the full partition path (e.g., 
> city=shanghai/.<log_file_name>.cdc), the regex in getFileVersionFromLog() 
> fails, resulting in an HoodieIOException.
>  
> Importantly, {*}this issue only occurs when a file group contains multiple 
> CDC log files{*}, since the sort operation is only triggered in that case. If 
> there's only one CDC file, the sort is skipped and the error does not occur.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to