pavel0fadeev commented on PR #48037: URL: https://github.com/apache/spark/pull/48037#issuecomment-2362343583
@juliuszsompolski Regarding the issue with your environment in **master** branch. I checked and it looks like I have the same problem when I run `"pyspark.ml.torch.tests.test_data_loader"` tests on my Mac with Python 3.11. For me it seems that `"AttributeError: Can't pickle local object '_SparkPartitionTorchDataset._get_field_converter.<locals>.converter'"` is the main error message here and there is an article about this https://medium.com/devopss-hole/python-multiprocessing-pickle-issue-e2d35ccf96a9. In short, in some circumstances there is a problem in the **multiprocessing** library when using "spawn" multiprocessing_context, which is the default for Mac and Windows. All Unix OS except Mac have "fork" as the default multiprocessing context and this may explain why we see different behaviour for this test on some local environments and on Github runners. So, **torch** library uses **multiprocessing** library in DataLoader and I tried to change multiprocessing_context to "fork" which helped me get over the initial exception but I got another one. Then I tried to set num_workers=0 for DataLoader to get rid of multiprocessing at all in this test: https://github.com/apache/spark/blob/04455797bfb3631b13b41cfa5d2604db3bf8acc2/python/pyspark/ml/torch/tests/test_data_loader.py#L70 I changed it to `data_loader = _get_spark_partition_data_loader(num_samples, batch_size, num_workers=0)` after that the test completed successfully. But I don't know how to change the environment to get rid of this issue without changing the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
