bhat-vinay opened a new pull request, #10865:
URL: https://github.com/apache/hudi/pull/10865

   The incremental source (for S3 and GCS) performas an existance check for the 
object (in the object store). This can be an expensive operation. This can be 
replaced with two spark session configs: `spark.sql.files.ignoreMissingFiles` 
and `spark.sql.files.ignoreCorruptFiles`. The documentation for the same can be 
found in 
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files.
   
   Of specific interest in the documentation is these lines, "Here, missing 
file really means the deleted file under directory after you construct the 
DataFrame.". This seems to be the case with the existance check in 
`CloudObjectsSelectorCommon.java`. Specifically, if someone wants to check for 
the existance of a file while the dataframe itself is being created, then this 
will not work.
   
   Added a test case by simulating file deletions _after_ DataFrame is created 
in TestCloudObjectsSelectorCommon::ignoreMissingFiles
   
   ### Change Logs
   GcsEventsHoodieIncrSource.java and S3EventsHoodieIncrSource.java
   Set the two spark options in sparkSession iff `ENABLE_EXISTS_CHECK` is set 
in the property. The two options are `spark.sql.files.ignoreMissingFiles` and 
`spark.sql.files.ignoreCorruptFiles`.
   
   CloudObjectsSelectorCommon.java:
   Remove the call to `checkIfFileExists(...)` when building a URL for a file
   
   IncrSourceCloudStorageHelper.java:
   Set teh same spak options in the sparkSession. These functions do not seem 
to be used anywhere, but cleaning up the usage of file-existance-check 
regardless.
   
   TestCloudObjectsSelectorCommon.java: New test case for the said funtionality
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   Low. New tests and existing tests should suffice.
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to