juliuszsompolski commented on PR #48037:
URL: https://github.com/apache/spark/pull/48037#issuecomment-2360953704

   I merged in latest master. Thanks @HyukjinKwon !
   
   As for my environment, I have a venv with python3.11, because that's what 
the CI seems to be using, I installed all dependencies with `pip install -r 
dev/requirements.txt`, and I am running the test with 
   ```
   python/run-tests --parallelism 1 --python-executables python3.11 --testnames 
pyspark.ml.torch.tests.test_data_loader
   ```
   after doing `build/sbt package`.
   For me the test fails after a moment with
   ```
     File 
"/opt/homebrew/Cellar/[email protected]/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/reduction.py",
 line 60, in dump
       ForkingPickler(file, protocol).dump(obj)
   AttributeError: Can't pickle local object 
'_SparkPartitionTorchDataset._get_field_converter.<locals>.converter'
   ...
   torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
   ============================================================
   
/Users/julek/dev/apache-spark/python/target/5fb69f79-c617-4c31-ae50-50ccd9b3f722/tmp0aee6vws/train.py
 FAILED
   ------------------------------------------------------------
   Failures:
     <NO_OTHER_FAILURES>
   ------------------------------------------------------------
   Root Cause (first observed failure):
   [0]:
     time      : 2024-09-19_15:09:01
     host      : QWKM9379T2
     rank      : 0 (local_rank: 0)
     exitcode  : 1 (pid: 53022)
     error_file: <N/A>
     traceback : To enable traceback see: 
https://pytorch.org/docs/stable/elastic/errors.html
   ============================================================
   ERROR (44.113s)
   ...
   py4j.protocol.Py4JJavaError: An error occurred while calling 
o75.collectToPython.
   : org.apache.spark.SparkException: Job aborted due to stage failure: Could 
not recover from a failed barrier ResultStage. Most recent failure reason: 
Stage failed because barrier task ResultTask(0, 0) finished unsuccessfully.
   ...
   RuntimeError: TorchDistributor failed during training.View stdout logs for 
detailed error message.
   ```
   which doesn't seem to be related to this PR. But on this PR CI it doesn't 
fail like that and instead the test hangs...
   
   Is there something that I'm missing in my environment?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to