Re: [PR] [SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources [spark]

via GitHub Thu, 11 Jan 2024 18:21:21 -0800


allisonwang-db commented on code in PR #44681:
URL: https://github.com/apache/spark/pull/44681#discussion_r1449709548



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala:
##########
@@ -34,23 +32,29 @@ import org.apache.spark.util.Utils
  * A manager for user-defined data sources. It is used to register and lookup 
data sources by
  * their short names or fully qualified names.
  */
-class DataSourceManager extends Logging {
+class DataSourceManager(
+    initDataSourceBuilders: => Option[
+      Map[String, UserDefinedPythonDataSource]] = None
+   ) extends Logging {
+  import DataSourceManager._
   // Lazy to avoid being invoked during Session initialization.
   // Otherwise, it goes infinite loop, session -> Python runner -> SQLConf -> 
session.
-  private lazy val dataSourceBuilders = {
-    val builders = new ConcurrentHashMap[String, UserDefinedPythonDataSource]()
-    builders.putAll(DataSourceManager.initialDataSourceBuilders.asJava)
-    builders
+  private lazy val staticDataSourceBuilders = initDataSourceBuilders.getOrElse 
{
+    initialDataSourceBuilders

Review Comment:
   Ah because of these two configs:
   ```
       val simplifiedTraceback: Boolean = SQLConf.get.pysparkSimplifiedTraceback
       val workerMemoryMb = SQLConf.get.pythonPlannerExecMemory
   ```
   I think instead of accessing the SQLConf here, we should pass them as 
parameters to this method `runInPython` to avoid this initialization issue. 
Maybe we can add a TODO for a follow up PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46670][PYTHON][SQL] Make DataSourceManager self clone-able by separating static and runtime Python Data Sources [spark]

Reply via email to