Hello,
We are looking for an approach to create a single chunk table due to the
issue [here](https://issues.apache.org/jira/browse/ARROW-11989). Single
chunk table would be much faster during indexing.
Currently, we write the table by first loading all files, convert to tables
and then combine chunks.
```python
for ds_file in all_datasets:
ds = pa.dataset.dataset(ds_file, format='feather')
train_datasets.append(ds.to_table())
combined_table = pa.concat_tables(train_datasets).combine_chunks()
with open(args.output + "{}.arrow".format(split), "wb") as f:
s = pa.ipc.new_stream(
f, schema, options=pa.ipc.IpcWriteOptions(allow_64bit=True)
)
s.write_batch(batches[0])
```
However, this approach takes memory size 2x original dataset size. I
wonder if there is a way to write the data one by one
but still ensure the single chunk?
Thank you!
--
Best,
Kaixiang