Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]

via GitHub Mon, 18 Mar 2024 12:59:59 -0700


yihua commented on code in PR #10865:
URL: https://github.com/apache/hudi/pull/10865#discussion_r1529188995



##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java:
##########
@@ -112,10 +110,15 @@ public S3EventsHoodieIncrSource(
       QueryRunner queryRunner,
       CloudDataFetcher cloudDataFetcher) {
     super(props, sparkContext, sparkSession, schemaProvider);
+
+    if (getBooleanWithAltKeys(props, ENABLE_EXISTS_CHECK)) {
+      sparkSession.conf().set("spark.sql.files.ignoreMissingFiles", "true");
+      sparkSession.conf().set("spark.sql.files.ignoreCorruptFiles", "true");

Review Comment:
   See spark docs: 
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files
 `Spark allows you to use the configuration spark.sql.files.ignoreMissingFiles 
or the data source option ignoreMissingFiles to ignore missing files while 
reading data from files.`
   
   You need to set `.option("ignoreMissingFiles")` to achieve the behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7504] replace expensive existence check with spark options [hudi]

Reply via email to