Kontinuation commented on code in PR #2251:
URL: https://github.com/apache/datafusion-comet/pull/2251#discussion_r2311777266
##########
spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala:
##########
@@ -372,3 +388,64 @@ case class CometScanTypeChecker(scanImpl: String) extends
DataTypeSupport with C
}
}
}
+
+object CometScanRule extends Logging {
+
+ /**
+ * Validating object store configs can cause requests to be made to S3 APIs
(such as when
+ * resolving the region for a bucket). We use a cache to reduce the number
of S3 calls.
+ *
+ * The key is the config map converted to a string. The value is the reason
that the config is
+ * not valid, or None if the config is valid.
+ */
+ val configValidityMap = new mutable.HashMap[String, Option[String]]()
+
+ /**
+ * We do not expect to see a large number of unique configs within the
lifetime of a Spark
+ * session, but we reset the cache once it reaches a fixed size to prevent
it growing
+ * indefinitely.
+ */
+ val configValidityMapMaxSize = 1024
+
+ def validateObjectStoreConfig(
+ filePath: String,
+ hadoopConf: Configuration,
+ fallbackReasons: mutable.ListBuffer[String]): Unit = {
+ val objectStoreConfigMap =
+ NativeConfig.extractObjectStoreOptions(hadoopConf, URI.create(filePath))
+
+ val cacheKey = objectStoreConfigMap
+ .map { case (k, v) =>
+ s"$k=$v"
+ }
+ .toList
+ .sorted
+ .mkString("\n")
+
+ if (configValidityMap.size >= configValidityMapMaxSize) {
Review Comment:
AFAIK, Spark will reuse the same Hadoop FileSystem object for each URL
scheme (bucket in case of S3). The Hadoop configuration will only be applied
once when the bucket is first accessed. Further changes to Hadoop filesystem
configurations won't be effective.
Using only the bucket name as the cache key should be sufficient according
to the behavior (or implementation detail) of Spark, but I think the current
more strict approach is better.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]