[Question][Python] How to create a large single chunk file without loading all tables into the memory

Kaixiang Lin Mon, 20 Dec 2021 18:18:39 -0800

Hello,

We are looking for an approach to create a single chunk table due to the
issue [here](https://issues.apache.org/jira/browse/ARROW-11989). Single
chunk table would be much faster during indexing.


Currently, we write the table by first loading all files, convert to tables
and then combine chunks.
```python
for ds_file in all_datasets:
        ds = pa.dataset.dataset(ds_file, format='feather')
        train_datasets.append(ds.to_table())
combined_table = pa.concat_tables(train_datasets).combine_chunks()

with open(args.output + "{}.arrow".format(split), "wb") as f:
      s = pa.ipc.new_stream(
          f, schema, options=pa.ipc.IpcWriteOptions(allow_64bit=True)
      )
      s.write_batch(batches[0])
```
However, this approach takes memory size 2x original dataset size.  I
wonder if there is a way to write the data one by one
but still ensure the single chunk?

Thank you!

-- 

Best,
Kaixiang

[Question][Python] How to create a large single chunk file without loading all tables into the memory

Reply via email to