guillermojp commented on issue #34455:
URL: https://github.com/apache/arrow/issues/34455#issuecomment-1517286346

   Just for completeness, I've also faced this issue trying to write PyArrow 
tables directly onto the file itself. So, if `tables` is a list of pa.Table, of 
this (example) format:
   
   ```python
   {'var_0': 0,
    'var_1': 33,
    'var_2': 0,
    'var_3': [20256, 3798],
    'image': array([[[  0,  58], # Array3D, in this case of shape (224, 224, 3)
           [  0,  57],
           [  0,  52],
           ...,
           [  0,  22],
           [  0, 245],
           [  0, 156]],
           ...,
           [  0,   0],
           [  0,   0],
           [  0,   0]]], dtype=uint8)}
   ```
   
   and with the following schema:
   
   ```python
   var_0: int32
   var_1: int32
   var_2: int32
   var_3: fixed_size_list<item: int32>[2]
     child 0, item: int32
   image: extension<arrow.py_extension_type<Array3DExtensionType>>
   ```
   
   I have tried to use the `ArrowWriter` as follows:
   
    ```python
   from datasets import Array3D, Dataset
   from datasets.arrow_writer import ArrowWriter
   from datasets.features.features import Array3DExtensionType
   
   with ArrowWriter(schema=schema, path=out_path) as writer:
       for table in tables:
           writer.write_row(table)
       writer.finalize()
   ```
   
   And it throws exactly the same error. Strangely, if I use "pydicts" as 
inputs and `writer.write` instead of `writer.write_row`, the error is resolved 
(minus, of course, the inefficiency in converting a pa.Table to pydict, etc; 
this is not a comparison in terms of computational time):
   
   ```python
   from datasets import Array3D, Dataset
   from datasets.arrow_writer import ArrowWriter
   from datasets.features.features import Array3DExtensionType
   
   with ArrowWriter(schema=schema, path=out_path) as writer:
       for table in tables:
           table_dict = table.to_pydict()
           writer.write(table_dict)
       writer.finalize()
   ```
   
   Beware, by default the `writer_batch_size` input variable in the 
`ArrowWriter` class or the `.write/.write_row` methods would be the one causing 
this issue, as `writer_batch_size` defaults to... 1000 was it? Setting 
`writer_batch_size = 1` would """solve""" the issue but of course the arrow 
file would be ungodly RAM-heavy to load into memory, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to