gyfora opened a new pull request #14174:
URL: https://github.com/apache/flink/pull/14174
## What is the purpose of the change
Fix bug where DelimitedInputFormat does not restore full file input splits.
This happens for example every time when we read compressed files.
The bug was caused by a faulty *if* condition but fixing it uncovered
another problem with the open/reopen mechanism where the reopen function would
call open all the time which would often lead to problems with backward seeking
the input split stream.
This happens because the open method already fills the read buffer to
discard partial rows at the beginning of splits which can cause the stream to
be read beyond the checkpoint offset. After this the reopen would seek to the
checkpoint offset (potentially backward) and fill the buffers and start reading
again.
To avoid this I separated open and reopen in the DelimitedInputFormat
properly and make sure that initialization logic placed in subclasses to the
open method is moved to a common initializeSplit method.
## Brief change log
*(for example:)*
- *Fix restore logic in DelimitedInputFormat#reopen for full file splits
(length == -1)*
- *Separate open/reopen DelimitedInputFormat and refactor initialization
logic to another method *
- *Added new tests for coverage*
## Verifying this change
This change added tests and can be verified as follows:
*(example:)*
- *Extended CsvInputFormatTest to cover compressed files*
- *Extended TextInputFormatTest to cover compressed files*
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: yes (it changes open/reopen methods but not the signature)
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]