zhengruifeng commented on PR #40607:
URL: https://github.com/apache/spark/pull/40607#issuecomment-1497315263
I am hitting a weird failure of
`TorchDistributorDistributedUnitTestsOnConnect.test_parity_torch_distributor`,
it appeared after I rebase this PR yesterday, but I don't find any suspicious
commits merged recently.
```
======================================================================
ERROR [18.362s]: test_end_to_end_run_distributedly
(pyspark.ml.tests.connect.test_parity_torch_distributor.TorchDistributorDistributedUnitTestsOnConnect)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/__w/spark/spark/python/pyspark/ml/torch/tests/test_distributor.py",
line 457, in test_end_to_end_run_distributedly
output = TorchDistributor(num_processes=2, local_mode=False,
use_gpu=False).run(
File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 749,
in run
output = self._run_distributed_training(framework_wrapper_fn,
train_object, *args)
File "/__w/spark/spark/python/pyspark/ml/torch/distributor.py", line 607,
in _run_distributed_training
self.spark.range(start=0, end=self.num_tasks, step=1,
numPartitions=self.num_tasks)
File "/__w/spark/spark/python/pyspark/sql/connect/dataframe.py", line
1354, in collect
table, schema = self._session.client.to_table(query)
File "/__w/spark/spark/python/pyspark/sql/connect/client.py", line 668, in
to_table
table, schema, _, _, _ = self._execute_and_fetch(req)
File "/__w/spark/spark/python/pyspark/sql/connect/client.py", line 982, in
_execute_and_fetch
for response in self._execute_and_fetch_as_iterator(req):
File "/__w/spark/spark/python/pyspark/sql/connect/client.py", line 963, in
_execute_and_fetch_as_iterator
self._handle_error(error)
File "/__w/spark/spark/python/pyspark/sql/connect/client.py", line 1055,
in _handle_error
self._handle_rpc_error(error)
File "/__w/spark/spark/python/pyspark/sql/connect/client.py", line 1095,
in _handle_rpc_error
raise SparkConnectGrpcException(str(rpc_error)) from None
pyspark.errors.exceptions.connect.SparkConnectGrpcException:
<_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Java heap space"
debug_error_string = "UNKNOWN:Error received from peer
ipv4:127.0.0.1:35071 {created_time:"2023-04-05T10:13:52.254507275+00:00",
grpc_status:2, grpc_message:"Java heap space"}"
>
```
In my local env, I can only repro this by decreasing the driver memory (e.g.
"spark.driver.memory", "512M"), And this issue can be simply resolve by
increase the driver memory to 1024M.
I tests different combinations locally like:
`spark.driver.memory=1024M, spark.executor.memory=512M`
`spark.driver.memory=1024M, spark.executor.memory=1024M`
etc
and they also works as expected.
But in Github Action (this resource limitation seems to be
https://github.com/apache/spark/blob/0b45a5278026c2ea9ce2b127333514f7a7a933f4/.github/workflows/build_and_test.yml#L1028),
no matter how larger driver memory I set (3G, 4G), this test just keeps
failing with this error message.
Do you have any thoughts on this? @WeichenXu123 @HyukjinKwon
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]