HyukjinKwon commented on issue #24958: [SPARK-28153][PYTHON] Use 
AtomicReference at InputFileBlockHolder (to support input_file_name with Python 
UDF)
URL: https://github.com/apache/spark/pull/24958#issuecomment-505657735
 
 
   I think Py4J is only used at driver side and we're safe on that. 
`InputFileBlockHolder.getXXX` is executed within an expression and set happens 
at iterator:
   
   ```
   core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala:          
InputFileBlockHolder.set(fs.getPath.toString, fs.getStart, fs.getLength)
   core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala:          
InputFileBlockHolder.set(fs.getPath.toString, fs.getStart, fs.getLength)
   
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
          InputFileBlockHolder.set(currentFile.filePath, currentFile.start, 
currentFile.length)
   
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala:
    InputFileBlockHolder.set(file.filePath, file.start, file.length)
   ```
   
   For hadoop, hadoop2, DV1 and DV2, it's set at executor's side.
   
   Even if there are some spots I missed, Py4j reuses the same threads for 
different tasks but the job execution call happens one at one time due to GIL 
and Py4J launches another thread if one thread is busy on JVM. So, it won't 
happen that one JVM thread somehow launches multiple jobs at the same time.
   
   Moreover, I opened a PR to pin thread between PVM and JVM - 
https://github.com/apache/spark/pull/24898 which might be more correct 
behaviour (?). If we could switch the mode, it can permanently get rid of this 
concern.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to