AlenkaF commented on issue #38675: URL: https://github.com/apache/arrow/issues/38675#issuecomment-1808026314
Hi @genesis-jamin , thank you for opening up an issue! Looking at the Ray docs I see a note on https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html saying > If pyarrow can’t represent your data, this method errors. So this is happening in your case as there is no option currently to consume PyTorch tensors in PyArrow. If I understand correctly, Ray is going through pandas (and pandas is using pyarrow) to write the dataset to parquet file and because torch tensors are not recognised by pyarrow you get an error. One option would be to use torch tensor as a NumPy ndarray (only a view, if I understand correctly https://pytorch.org/docs/stable/generated/torch.Tensor.numpy.html) ```python In [1]: import ray ...: import torch ...: ds_test = ray.data.from_items([{"nested": {"tensor_a": torch.zeros(5).numpy(), "tensor_b": torch.zeros(5).numpy()}}]) 2023-11-13 12:48:11,823 INFO worker.py:1673 -- Started a local Ray instance. In [2]: ds_test.write_parquet("test_parquet") 2023-11-13 12:48:15,472 INFO streaming_executor.py:104 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Write] ... In [3]: import pyarrow as pa In [4]: pa.parquet.read_table("test_parquet").to_pandas() Out[4]: nested 0 {'tensor_a': [0.0, 0.0, 0.0, 0.0, 0.0], 'tenso... 1 {'tensor_a': [0.0, 0.0, 0.0, 0.0, 0.0], 'tenso... 2 {'tensor_a': [0.0, 0.0, 0.0, 0.0, 0.0], 'tenso... ``` It would be nice to be able to convert these kind of data to numpy arrays (or even fixed shape tensor in the nested case) on our side, what do you think @jorisvandenbossche ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
