HyukjinKwon opened a new pull request #23690: [MINOR][PYTHON] Minor reduce Py4J 
communication cost in PySpark's execution barrier check
URL: https://github.com/apache/spark/pull/23690
 
 
   ## What changes were proposed in this pull request?
   
   I am investigating flaky tests. I realised that:
   
   ```
         File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/rdd.py", line 
2512, in __init__
           self.is_barrier = prev._is_barrier() or isFromBarrier
         File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/rdd.py", line 
2412, in _is_barrier
           return self._jrdd.rdd().isBarrier()
         File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 1286, in __call__
           answer, self.gateway_client, self.target_id, self.name)
         File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
 line 342, in get_return_value
           return OUTPUT_CONVERTER[type](answer[2:], gateway_client)
         File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 2492, in <lambda>
           lambda target_id, gateway_client: JavaObject(target_id, 
gateway_client))
         File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 1324, in __init__
           ThreadSafeFinalizer.add_finalizer(key, value)
         File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/finalizer.py",
 line 43, in add_finalizer
           cls.finalizers[id] = weak_ref
         File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 216, in 
__exit__
           self.release()
         File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 208, in 
release
           self.__block.release()
       error: release unlocked lock
   ```
   
   I assume it might not be directly related with the test itself but I noticed 
that it `prev._is_barrier()` attempts to access via Py4J.
   
   Accessing via Py4J is expensive and IMHO it makes it flaky. Therefore, this 
PR proposes to avoid Py4J access when `isFromBarrier` is `True`.
   
   ## How was this patch tested?
   
   Unittests should cover this.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to