Hi Updating the thread for people with a similar use case. A new project called [duckdb](https://github.com/cwida/duckdb) allows usage of Arrow memory mapped files as virtual tables, so a lot of pandas functionality can be covered using their sql equivalents. Duckdb works equally well with chunked tables, so that alleviates the need for contiguous columns in the Arrow file.
Thank you, Ishan ________________________________ From: Sutou Kouhei <k...@clear-code.com> Sent: Friday, September 11, 2020 3:23 AM To: u...@arrow.apache.org <u...@arrow.apache.org>; dev@arrow.apache.org <dev@arrow.apache.org> Subject: Re: [Python/C-Glib] writing IPC file format column-by-column Hi, I add dev@ because this may need to improve Apache Arrow C++. It seems that we need the following new feature for this use case (combining chunks with small memory to process large data with pandas, mmap and small memory): * Writing chunks in arrow::Table as one large arrow::RecordTable without creating intermediate combined chunks The current arrow::ipc::RecordBatchWriter::WriteTable() always splits the given arrow::Table to one or more arrow::RecordBatch. We may be able to add the feature that writes the given arrow::Table as one combined arrow::RecordBatch without creating intermediate combined chunks. Do C++ developers have any opinion on this? Thanks, -- kou In <ch2pr20mb30950806b40fe286d414ac97eb...@ch2pr20mb3095.namprd20.prod.outlook.com> "[Python/C-Glib] writing IPC file format column-by-column " on Wed, 9 Sep 2020 10:11:54 +0000, Ishan Anand <anand.is...@outlook.com> wrote: > Hi > > I'm looking at using Arrow primarily on low-resource instances with out of > memory datasets. This is the workflow I'm trying to implement. > > > * Write record batches in IPC streaming format to a file from a C runtime. > * Consume it one row at a time from python/C by loading the file in > chunks. > * If the schema is simple enough to support zero copy operations, make > the table readable from pandas. This needs me to, > * convert it into a Table with a single chunk per column (since pandas > can't use mmap with chunked arrays). > * write the table in IPC random access format to a file. > > PyArrow provides a method `combine_chunks` to combine chunks into a single > chunk. However, it needs to create the entire table in memory (I suspect it > is 2x, since it loads both versions of the table in memory but that can be > avoided). > > Since the Arrow layout is columnar, I'm curious if it is possible to write > the table one column at a time. And if the existing glib/python APIs support > it? The C++ file writer objects seem to go down to serializing a single > record batch at a time and not per column. > > > Thank you, > Ishan