AlenkaF commented on issue #38675:
URL: https://github.com/apache/arrow/issues/38675#issuecomment-1808026314

   Hi @genesis-jamin , thank you for opening up an issue!
   
   Looking at the Ray docs I see a note on 
https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html 
saying
   > If pyarrow can’t represent your data, this method errors.
   
   So this is happening in your case as there is no option currently to consume 
PyTorch tensors in PyArrow. If I understand correctly, Ray is going through 
pandas (and pandas is using pyarrow) to write the dataset to parquet file and 
because torch tensors are not recognised by pyarrow you get an error.
   
   One option would be to use torch tensor as a NumPy ndarray (only a view, if 
I understand correctly 
https://pytorch.org/docs/stable/generated/torch.Tensor.numpy.html)
   
   ```python
   In [1]: import ray
      ...: import torch
      ...: ds_test = ray.data.from_items([{"nested": {"tensor_a": 
torch.zeros(5).numpy(), "tensor_b": torch.zeros(5).numpy()}}])
   2023-11-13 12:48:11,823 INFO worker.py:1673 -- Started a local Ray instance.
   
   In [2]: ds_test.write_parquet("test_parquet")
   2023-11-13 12:48:15,472 INFO streaming_executor.py:104 -- Executing DAG 
InputDataBuffer[Input] -> TaskPoolMapOperator[Write]
   ...
   
   In [3]: import pyarrow as pa
   
   In [4]: pa.parquet.read_table("test_parquet").to_pandas()
   Out[4]: 
                                                 nested
   0  {'tensor_a': [0.0, 0.0, 0.0, 0.0, 0.0], 'tenso...
   1  {'tensor_a': [0.0, 0.0, 0.0, 0.0, 0.0], 'tenso...
   2  {'tensor_a': [0.0, 0.0, 0.0, 0.0, 0.0], 'tenso...
   ```
   
   It would be nice to be able to convert these kind of data to numpy arrays 
(or even fixed shape tensor in the nested case) on our side, what do you think 
@jorisvandenbossche ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to