HyukjinKwon edited a comment on issue #24958: [SPARK-28153][PYTHON] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) URL: https://github.com/apache/spark/pull/24958#issuecomment-505657735 I think Py4J is only used at driver side and we're safe on that. `InputFileBlockHolder.getXXX` is used within related expressions (e.g., `input_file_name`) ``` sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala: InputFileBlockHolder.getInputFilePath sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala: val className = InputFileBlockHolder.getClass.getName.stripSuffix("$") sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala: InputFileBlockHolder.getStartOffset sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala: val className = InputFileBlockHolder.getClass.getName.stripSuffix("$") sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala: InputFileBlockHolder.getLength sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala: val className = InputFileBlockHolder.getClass.getName.stripSuffix("$") ``` and `InputFileBlockHolder.set` happens at iterator, for hadoop, hadoop2, DS1 and DS2 ``` core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala: InputFileBlockHolder.set(fs.getPath.toString, fs.getStart, fs.getLength) core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala: InputFileBlockHolder.set(fs.getPath.toString, fs.getStart, fs.getLength) sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala: InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length) sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala: InputFileBlockHolder.set(file.filePath, file.start, file.length) ``` So they are set and get at executor's side. Even if there are some spots I missed, Py4j reuses the same threads for different tasks but the job execution call happens one at one time due to GIL and Py4J launches another thread if one thread is busy on JVM. So, it won't happen that one JVM thread somehow launches multiple jobs at the same time and same thread. Moreover, I opened a PR to pin thread between PVM and JVM - https://github.com/apache/spark/pull/24898 which might be more correct behaviour (?). If we could switch the mode, it can permanently get rid of this concern.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
