brkyvz commented on issue #24958: [SPARK-28153][PYTHON] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) URL: https://github.com/apache/spark/pull/24958#issuecomment-505517283 While this fix seems fine, I'm worried it could actually lead to correctness issues. Py4j may reuse the same threads for different tasks, therefore I'm worried that we could return the wrong reference. A more robust solution would be: 1. InputFileBlocks are stored in a ConcurrentHashMap, from taskAttemptId (unique in a Spark cluster) -> InputFileBlock. 2. In Scala we can just look up these values 3. In Python and R, we already get the TaskContext as part of the protocol (in worker.py) it seems. We need to do a look up by the TaskContext.taskAttemptId. What do you think?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
