bhat-vinay opened a new pull request, #10865: URL: https://github.com/apache/hudi/pull/10865
The incremental source (for S3 and GCS) performas an existance check for the object (in the object store). This can be an expensive operation. This can be replaced with two spark session configs: `spark.sql.files.ignoreMissingFiles` and `spark.sql.files.ignoreCorruptFiles`. The documentation for the same can be found in https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files. Of specific interest in the documentation is these lines, "Here, missing file really means the deleted file under directory after you construct the DataFrame.". This seems to be the case with the existance check in `CloudObjectsSelectorCommon.java`. Specifically, if someone wants to check for the existance of a file while the dataframe itself is being created, then this will not work. Added a test case by simulating file deletions _after_ DataFrame is created in TestCloudObjectsSelectorCommon::ignoreMissingFiles ### Change Logs GcsEventsHoodieIncrSource.java and S3EventsHoodieIncrSource.java Set the two spark options in sparkSession iff `ENABLE_EXISTS_CHECK` is set in the property. The two options are `spark.sql.files.ignoreMissingFiles` and `spark.sql.files.ignoreCorruptFiles`. CloudObjectsSelectorCommon.java: Remove the call to `checkIfFileExists(...)` when building a URL for a file IncrSourceCloudStorageHelper.java: Set teh same spak options in the sparkSession. These functions do not seem to be used anywhere, but cleaning up the usage of file-existance-check regardless. TestCloudObjectsSelectorCommon.java: New test case for the said funtionality ### Impact None ### Risk level (write none, low medium or high below) Low. New tests and existing tests should suffice. ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
