[
https://issues.apache.org/jira/browse/HUDI-9475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Moran updated HUDI-9475:
------------------------
Summary: Fix invalid log file name exception during CDC query (was: Fix
Invalid log file name exception during CDC query)
> Fix invalid log file name exception during CDC query
> ----------------------------------------------------
>
> Key: HUDI-9475
> URL: https://issues.apache.org/jira/browse/HUDI-9475
> Project: Apache Hudi
> Issue Type: Bug
> Components: cdc
> Reporter: Moran
> Priority: Major
>
> While using Hudi's Change Data Capture (CDC) feature, we encountered the
> following exception:
> {code:java}
> org.apache.hudi.exception.HoodieIOException: Invalid log file name:
> city=shanghai/.c25b1849-323e-499f-90d2-341936a91b37-4_20250418115102802.log.2_61444-17-491101.cdc
> at
> org.apache.hudi.common.fs.FSUtils.getFileVersionFromLog(FSUtils.java:433)
> at
> java.util.Comparator.lambda$comparingInt$7b0bb60$1(Comparator.java:490)
> at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
> at java.util.TimSort.sort(TimSort.java:220)
> at java.util.Arrays.sort(Arrays.java:1512)
> at
> java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:348)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> at
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at
> org.apache.hudi.common.table.cdc.HoodieCDCFileSplit.(HoodieCDCFileSplit.java:100)
> at
> org.apache.hudi.common.table.cdc.HoodieCDCFileSplit.(HoodieCDCFileSplit.java:79)
> at
> org.apache.hudi.common.table.cdc.HoodieCDCExtractor.parseWriteStat(HoodieCDCExtractor.java:294)
> at org.apache.hudi.common.table.cdc.HoodieCDCExtractor.lambda{code}
> The error originates from {*}getFileVersionFromLog(String
> logFileName){*},which expects a log file name without a path. However, in the
> HoodieCDCFileSplit constructor, full paths (including partition directories)
> are passed directly to this method during sorting:
> {code:java}
> this.cdcFiles = cdcFiles.stream()
> .sorted(Comparator.comparingInt(FSUtils::getFileVersionFromLog))
> .collect(Collectors.toList()); {code}
> Since the input includes the full partition path (e.g.,
> city=shanghai/.<log_file_name>.cdc), the regex in getFileVersionFromLog()
> fails, resulting in an HoodieIOException.
>
> Importantly, {*}this issue only occurs when a file group contains multiple
> CDC log files{*}, since the sort operation is only triggered in that case. If
> there's only one CDC file, the sort is skipped and the error does not occur.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)