Re: Debating using Arrow as our In Memory Cache

Wes McKinney Fri, 19 May 2017 11:01:50 -0700

You will want to read the header files in:

https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow

You can see example usage of the C++ API in the Python bindings:

https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx

Kouhei Sutou has also created GLib bindings for the Parquet-Arrow
connector 
https://github.com/red-data-tools/parquet-glib/tree/master/parquet-glib.

As an in-memory (or memory-mapped) cache layer, Arrow is an ideal fit.
As an example, the Ray project at the Berkeley RISELab is using Arrow
along with a shared memory object broker they developed called Plasma
as their data plane for IPC / multiprocess scheduling
(https://github.com/ray-project/ray).

Beyond the API for reading (zero-copy, if possible) and writing the
Arrow IPC format, there are no particular built-in tools for managing
memory lifetime. The RecordBatch and Array data structures are
deliberately minimalist, with the central
std::shared_ptr<arrow::Buffer> construct for reference-counted memory
sharing (see the arrow::SliceBuffer method which provides for
zero-copy slices while maintaining a reference to the parent buffer).

> If I wanted a cache layer that could leverage arrow but still be able to 
> access data directly from parquet when it was not loaded into Arrow is the 
> best way to have some kind of manager that will load the data into Arrow when 
> it is not available? Or is there some kind of API where Arrow can know, if 
> they request this data I need to load it from this parquet file?

Yes, I would suggest defining a virtual table abstraction where
requests will load from Parquet to Arrow if the data has not been
previously decoded otherwise, and place this data in some kind of LRU
cache. We are planning to build something similar to this for pandas2.

- Wes

On Fri, May 19, 2017 at 1:26 PM, Felipe Aramburu <[email protected]> wrote:
> This is about the C++ api
> We are changing our underlying storage to be based on Parquet files instead
> of using a proprietary format that we developed. Arrows integration with
> Parquet makes it attractive for leveraging it as our cache layer but I am
> having trouble finding much documentation on reading files from Parquet
> into Arrow using the c++ api  and the examples are somewhat limited.
>
> Currently our own memory manager handles things like expiring data when it
> is stale or goes above a threshhold and has a tightly integrated API with
> our storage layer. I.E. you can request stuff from it even if it has not
> been loaded yet and the Cache layer will get that data directly from disk.
>
> Do any utilities exist in arrow for managing memory consumption and
> releasing information from cache as its consumption increases? Are there
> ways of detecting when the last time some information was accessed?
>
> If I wanted a cache layer that could leverage arrow but still be able to
> access data directly from parquet when it was not loaded into Arrow is the
> best way to have some kind of manager that will load the data into Arrow
> when it is not available? Or is there some kind of API where Arrow can
> know, if they request this data I need to load it from this parquet file?
>
> Are there any docs available for the c++ apis of Arrow and Parquet other
> than what is found at
>
> https://arrow.apache.org/docs/cpp/
> https://github.com/apache/parquet-cpp
>
>
> Felipe Aramburu
> ᐧ

Re: Debating using Arrow as our In Memory Cache

Reply via email to