guillermojp commented on issue #34455:
URL: https://github.com/apache/arrow/issues/34455#issuecomment-1517286346
Just for completeness, I've also faced this issue trying to write PyArrow
tables directly onto the file itself. So, if `tables` is a list of pa.Table, of
this (example) format:
```python
{'var_0': 0,
'var_1': 33,
'var_2': 0,
'var_3': [20256, 3798],
'image': array([[[ 0, 58], # Array3D, in this case of shape (224, 224, 3)
[ 0, 57],
[ 0, 52],
...,
[ 0, 22],
[ 0, 245],
[ 0, 156]],
...,
[ 0, 0],
[ 0, 0],
[ 0, 0]]], dtype=uint8)}
```
and with the following schema:
```python
var_0: int32
var_1: int32
var_2: int32
var_3: fixed_size_list<item: int32>[2]
child 0, item: int32
image: extension<arrow.py_extension_type<Array3DExtensionType>>
```
I have tried to use the `ArrowWriter` as follows:
```python
from datasets import Array3D, Dataset
from datasets.arrow_writer import ArrowWriter
from datasets.features.features import Array3DExtensionType
with ArrowWriter(schema=schema, path=out_path) as writer:
for table in tables:
writer.write_row(table)
writer.finalize()
```
And it throws exactly the same error. Strangely, if I use "pydicts" as
inputs and `writer.write` instead of `writer.write_row`, the error is resolved
(minus, of course, the inefficiency in converting a pa.Table to pydict, etc;
this is not a comparison in terms of computational time):
```python
from datasets import Array3D, Dataset
from datasets.arrow_writer import ArrowWriter
from datasets.features.features import Array3DExtensionType
with ArrowWriter(schema=schema, path=out_path) as writer:
for table in tables:
table_dict = table.to_pydict()
writer.write(table_dict)
writer.finalize()
```
Beware, by default the `writer_batch_size` input variable in the
`ArrowWriter` class or the `.write/.write_row` methods would be the one causing
this issue, as `writer_batch_size` defaults to... 1000 was it? Setting
`writer_batch_size = 1` would """solve""" the issue but of course the arrow
file would be ungodly RAM-heavy to load into memory, etc.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]