StephanEwen commented on pull request #13574: URL: https://github.com/apache/flink/pull/13574#issuecomment-718210093
@stevenzwu You are right, this tradeoff exists. It exists in lot's of places in Flink (and I believe other systems as well). Either you have synchronous error reporting on job submission, or you support long initialization phases. Flink has generally moved to supporting longer initialization phases, because they just happen all the time (lot's of files to enumerate, blocking connections to S3 / Kafka, etc.). CLI and SQL client switch immediately to status polling after submitting the job, so they still report errors fast. File enumeration happens already asynchronous to job submission in the current code, because the whole execution graph construction and job initialization is already asynchronous to the job submission. At least it is in Flink 1.12. That change was made also with state backend initialization, savepoint loading, etc. in mind. So if these parts take long, it no longer leads to a request timeout for the job submission. But it does mean some errors are not any more returned on the "submit job" call, but only on a later status poll. Tradeoffs :-/ Moving file enumeration into the `SplitEnumerator` and doing it asynchronously there would be totally fine, you get a similar behavior as now. With the added benefit that the job starts scheduling tasks faster, because the execution graph initializes faster (enumerators are initialized as part of that), and only after that the scheduling of tasks starts. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
