Simon Frei created FLINK-38411:
----------------------------------
Summary: Array index overflow in enumerator with continuous file
source
Key: FLINK-38411
URL: https://issues.apache.org/jira/browse/FLINK-38411
Project: Flink
Issue Type: Bug
Components: Connectors / FileSystem
Reporter: Simon Frei
After running a standard file source with continuous monitoring for a while,
the following exception occurred:
{{ERROR}}
{{org.apache.flink.connector.file.src.impl.ContinuousFileSplitEnumerator}}
{{[] - Failed to enumerate files}}
{{java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for}}
{{length 10}}
{{at}}
{{org.apache.flink.connector.file.src.enumerate.NonSplittingRecursiveEnumerator.incrementCharArrayByOne(NonSplittingRecursiveEnumerator.java:149)}}
{{~[flink-connector-files-1.17.2.jar:1.17.2]}}
The problem exists since this was first introduced in
https://github.com/apache/flink/pull/13401 and still exists now:
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/NonSplittingRecursiveEnumerator.java#L56
The following comment is above it:
{{ /**}}
{{ * The current Id as a mutable string representation. This covers more
values than the integer}}
{{ * value range, so we should never overflow.}}
{{ */}}
While it's true that the 10-digit integer is larger than an `int`, I don't see
why that would be considered sufficient. For an ID/counter use-case like this
it's simply not enough, overflow can easily occur. E.g. in this real example one
file per minute gets added and we monitor with a 2min interval. With 6 months of
history it only takes 54 days to overflow.
Imo this is generally a case of premature optimisation, converting a long to a
string here would hardly be relevant compared to all the filesystem
interactions and IO that happens. And if it was, that's something for a
benchmark to show. If you'd like to keep that pattern I'd propose increasing
the size to cover the range of a long instead. Happy to provide a PR for either
option, let me know if you have a preference otherwise I'll just do the former
soon-ish.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)