[GitHub] [spark] zhengruifeng commented on pull request #40607: [SPARK-42993][ML][CONNECT] Make PyTorch Distributor compatible with Spark Connect

via GitHub Wed, 05 Apr 2023 04:13:41 -0700


zhengruifeng commented on PR #40607:
URL: https://github.com/apache/spark/pull/40607#issuecomment-1497315263


   I am hitting a weird failure of 
`TorchDistributorDistributedUnitTestsOnConnect.test_parity_torch_distributor`, 
it appeared after I rebase this PR yesterday, but I don't find any suspicious 
commits merged recently.
   
   ```
   ======================================================================
   ERROR [18.362s]: test_end_to_end_run_distributedly 
(pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorDistributedUnitTestsOnConnect)
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py", 
line 457, in test_end_to_end_run_distributedly
       output = TorchDistributor(num_processes=2, local_mode=False, 
use_gpu=False).run(
     File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 749, 
in run
       output = self._run_distributed_training(framework_wrapper_fn, 
train_object, *args)
     File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 607, 
in _run_distributed_training
       self.spark.range(start=0, end=self.num_tasks, step=1, 
numPartitions=self.num_tasks)
     File "/__w/spark/spark/python/pyspark/sql/connect/dataframe.py", line 
1354, in collect
       table, schema = self._session.client.to_table(query)
     File "/__w/spark/spark/python/pyspark/sql/connect/client.py", line 668, in 
to_table
       table, schema, _, _, _ = self._execute_and_fetch(req)
     File "/__w/spark/spark/python/pyspark/sql/connect/client.py", line 982, in 
_execute_and_fetch
       for response in self._execute_and_fetch_as_iterator(req):
     File "/__w/spark/spark/python/pyspark/sql/connect/client.py", line 963, in 
_execute_and_fetch_as_iterator
       self._handle_error(error)
     File "/__w/spark/spark/python/pyspark/sql/connect/client.py", line 1055, 
in _handle_error
       self._handle_rpc_error(error)
     File "/__w/spark/spark/python/pyspark/sql/connect/client.py", line 1095, 
in _handle_rpc_error
       raise SparkConnectGrpcException(str(rpc_error)) from None
   pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
<_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Java heap space"
        debug_error_string = "UNKNOWN:Error received from peer 
ipv4:127.0.0.1:35071 {created_time:"2023-04-05T10:13:52.254507275+00:00", 
grpc_status:2, grpc_message:"Java heap space"}"
   >
   ```
   
   
   In my local env, I can only repro this by decreasing the driver memory (e.g. 
"spark.driver.memory", "512M"), And this issue can be simply resolve by 
increase the driver memory to 1024M.
   I tests different combinations locally like:
   `spark.driver.memory=1024M, spark.executor.memory=512M`
   `spark.driver.memory=1024M, spark.executor.memory=1024M`
   etc
   and they also works as expected.
   
   But in Github Action (this resource limitation seems to be 
https://github.com/apache/spark/blob/0b45a5278026c2ea9ce2b127333514f7a7a933f4/.github/workflows/build_and_test.yml#L1028),
 no matter how larger driver memory I set (3G, 4G), this test just keeps 
failing with this error message.
   
   Do you have any thoughts on this? @WeichenXu123 @HyukjinKwon 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on pull request #40607: [SPARK-42993][ML][CONNECT] Make PyTorch Distributor compatible with Spark Connect

Reply via email to