juliuszsompolski commented on PR #48037:
URL: https://github.com/apache/spark/pull/48037#issuecomment-2360953704
I merged in latest master. Thanks @HyukjinKwon !
As for my environment, I have a venv with python3.11, because that's what
the CI seems to be using, I installed all dependencies with `pip install -r
dev/requirements.txt`, and I am running the test with
```
python/run-tests --parallelism 1 --python-executables python3.11 --testnames
pyspark.ml.torch.tests.test_data_loader
```
after doing `build/sbt package`.
For me the test fails after a moment with
```
File
"/opt/homebrew/Cellar/[email protected]/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/reduction.py",
line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object
'_SparkPartitionTorchDataset._get_field_converter.<locals>.converter'
...
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/Users/julek/dev/apache-spark/python/target/5fb69f79-c617-4c31-ae50-50ccd9b3f722/tmp0aee6vws/train.py
FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-19_15:09:01
host : QWKM9379T2
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 53022)
error_file: <N/A>
traceback : To enable traceback see:
https://pytorch.org/docs/stable/elastic/errors.html
============================================================
ERROR (44.113s)
...
py4j.protocol.Py4JJavaError: An error occurred while calling
o75.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could
not recover from a failed barrier ResultStage. Most recent failure reason:
Stage failed because barrier task ResultTask(0, 0) finished unsuccessfully.
...
RuntimeError: TorchDistributor failed during training.View stdout logs for
detailed error message.
```
which doesn't seem to be related to this PR. But on this PR CI it doesn't
fail like that and instead the test hangs...
Is there something that I'm missing in my environment?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]