Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17617
  
    @holdenk , the basic problem is that Spark uses Hadoop FileSystem's 
statistics API to get bytesRead, bytesWrite per task. This statistics API is 
implemented by thread local variables, it is OK for scala / java RDD 
computations, since this computation is executed in the same thread as the task 
thread. But for PythonRDD, Spark will create another thread to consume data. So 
using current way to count bytesRead will get a wrong number.
    
    This is a generic problem when task thread and RDD computation thread are 
not the same thread, due to thread local variables problem, the calculated 
bytesRead metric will be wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to