[GitHub] [hudi] umehrot2 commented on a diff in pull request #6154: [HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature

GitBox Mon, 25 Jul 2022 11:50:53 -0700


umehrot2 commented on code in PR #6154:
URL: https://github.com/apache/hudi/pull/6154#discussion_r929195103



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -56,6 +56,9 @@ class DefaultSource extends RelationProvider
       // Enable "passPartitionByAsOptions" to support "write.partitionBy(...)"
       
spark.conf.set("spark.sql.legacy.sources.write.passPartitionByAsOptions", 
"true")
     }
+    // Revisit EMR Spark and EMRFS incompatibilities, for now disable
+    spark.conf.set("spark.sql.dataPrefetch.enabled", "false")
+    
spark.sparkContext.hadoopConfiguration.set("fs.s3.metadata.cache.expiration.seconds",
 "0")

Review Comment:
   Well the only reason we did this is because we want to reduce the noise, for 
customers having to pass additional configurations just to make things work on 
EMR. We cannot store in EMR Hudi configs, because as of now the global Hudi 
confs that we support only work for Hudi related configurations. We cannot pass 
spark/hadoop configs in them.
   
   If you guys have concerns about this, we can revert it and instead have it 
in the documentation that customers should explicitly pass these when running 
open source bundle on EMR. Its just that it is not a good experience.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] umehrot2 commented on a diff in pull request #6154: [HUDI-4434] Disable EmrFS file metadata caching and EMR Spark's data prefetcher feature

Reply via email to