brkyvz commented on issue #24958: [SPARK-28153][PYTHON] Use AtomicReference at 
InputFileBlockHolder (to support input_file_name with Python UDF)
URL: https://github.com/apache/spark/pull/24958#issuecomment-505517283
 
 
   While this fix seems fine, I'm worried it could actually lead to correctness 
issues. Py4j may reuse the same threads for different tasks, therefore I'm 
worried that we could return the wrong reference.
   
   A more robust solution would be:
     1. InputFileBlocks are stored in a ConcurrentHashMap, from taskAttemptId 
(unique in a Spark cluster) -> InputFileBlock.
     2. In Scala we can just look up these values
     3. In Python and R, we already get the TaskContext as part of the protocol 
(in worker.py) it seems. We need to do a look up by the 
TaskContext.taskAttemptId.
   
   What do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to