Re: [Python/C-Glib] writing IPC file format column-by-column

Ishan Anand Sun, 27 Sep 2020 14:18:44 -0700

Hi

Updating the thread for people with a similar use case. A new project called 
[duckdb](https://github.com/cwida/duckdb) allows usage of Arrow memory mapped 
files as virtual tables, so a lot of pandas functionality can be covered using 
their sql equivalents. Duckdb works equally well with chunked tables, so that 
alleviates the need for contiguous columns in the Arrow file.


Thank you,
Ishan
________________________________
From: Sutou Kouhei <k...@clear-code.com>
Sent: Friday, September 11, 2020 3:23 AM
To: u...@arrow.apache.org <u...@arrow.apache.org>; dev@arrow.apache.org 
<dev@arrow.apache.org>
Subject: Re: [Python/C-Glib] writing IPC file format column-by-column

Hi,

I add dev@ because this may need to improve Apache Arrow C++.

It seems that we need the following new feature for this use
case (combining chunks with small memory to process large
data with pandas, mmap and small memory):

  * Writing chunks in arrow::Table as one large
    arrow::RecordTable without creating intermediate
    combined chunks

The current arrow::ipc::RecordBatchWriter::WriteTable()
always splits the given arrow::Table to one or more
arrow::RecordBatch. We may be able to add the feature that
writes the given arrow::Table as one combined
arrow::RecordBatch without creating intermediate combined
chunks.


Do C++ developers have any opinion on this?


Thanks,
--
kou

In
 
<ch2pr20mb30950806b40fe286d414ac97eb...@ch2pr20mb3095.namprd20.prod.outlook.com>
  "[Python/C-Glib] writing IPC file format column-by-column " on Wed, 9 Sep 
2020 10:11:54 +0000,
  Ishan Anand <anand.is...@outlook.com> wrote:

> Hi
>
> I'm looking at using Arrow primarily on low-resource instances with out of 
> memory datasets. This is the workflow I'm trying to implement.
>
>
>   *   Write record batches in IPC streaming format to a file from a C runtime.
>   *   Consume it one row at a time from python/C by loading the file in 
> chunks.
>   *   If the schema is simple enough to support zero copy operations, make 
> the table readable from pandas. This needs me to,
>      *   convert it into a Table with a single chunk per column (since pandas 
> can't use mmap with chunked arrays).
>      *   write the table in IPC random access format to a file.
>
> PyArrow provides a method `combine_chunks` to combine chunks into a single 
> chunk. However, it needs to create the entire table in memory (I suspect it 
> is 2x, since it loads both versions of the table in memory but that can be 
> avoided).
>
> Since the Arrow layout is columnar, I'm curious if it is possible to write 
> the table one column at a time. And if the existing glib/python APIs support 
> it? The C++ file writer objects seem to go down to serializing a single 
> record batch at a time and not per column.
>
>
> Thank you,
> Ishan

Re: [Python/C-Glib] writing IPC file format column-by-column

Reply via email to