[GitHub] [spark] HyukjinKwon edited a comment on issue #24958: [SPARK-28153][PYTHON] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

GitBox Tue, 25 Jun 2019 18:33:12 -0700

HyukjinKwon edited a comment on issue #24958: [SPARK-28153][PYTHON] Use 
AtomicReference at InputFileBlockHolder (to support input_file_name with Python 
UDF)
URL: https://github.com/apache/spark/pull/24958#issuecomment-505657735
 
 
   I think Py4J is only used at driver side and we're safe on that. 
`InputFileBlockHolder.getXXX` is used within related expressions (e.g., 
`input_file_name`)
   
   ```
   
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala:
    InputFileBlockHolder.getInputFilePath
   
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala:
    val className = InputFileBlockHolder.getClass.getName.stripSuffix("$")
   
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala:
    InputFileBlockHolder.getStartOffset
   
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala:
    val className = InputFileBlockHolder.getClass.getName.stripSuffix("$")
   
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala:
    InputFileBlockHolder.getLength
   
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala:
    val className = InputFileBlockHolder.getClass.getName.stripSuffix("$")
   ```
   
   and `InputFileBlockHolder.set` happens at iterator, for hadoop, hadoop2, DS1 
and DS2
   
   ```
   core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala:          
InputFileBlockHolder.set(fs.getPath.toString, fs.getStart, fs.getLength)
   core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala:          
InputFileBlockHolder.set(fs.getPath.toString, fs.getStart, fs.getLength)
   
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
          InputFileBlockHolder.set(currentFile.filePath, currentFile.start, 
currentFile.length)
   
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala:
    InputFileBlockHolder.set(file.filePath, file.start, file.length)
   ```
   
   So they are set and get at executor's side.
   
   Even if there are some spots I missed, Py4j reuses the same threads for 
different tasks but the job execution call happens one at one time due to GIL 
and Py4J launches another thread if one thread is busy on JVM. So, it won't 
happen that one JVM thread somehow launches multiple jobs at the same time and 
same thread.
   
   Moreover, I opened a PR to pin thread between PVM and JVM - 
https://github.com/apache/spark/pull/24898 which might be more correct 
behaviour (?). If we could switch the mode, it can permanently get rid of this 
concern.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon edited a comment on issue #24958: [SPARK-28153][PYTHON] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)

Reply via email to