Take a look at the Plasma object store https://arrow.apache.org/docs/python/plasma.html.
Here's an example using it (along with multiprocessing to sort a pandas dataframe) https://github.com/apache/arrow/blob/master/python/examples/plasma/sorting/sort_df.py. It's possible the example is a bit out of date. You may be interested in taking a look at Ray https://github.com/ray-project/ray. We use Plasma/Arrow under the hood to do all of these things but hide a lot of the bookkeeping (like object ID generation). For your setting, you can think of it as a replacement for Python multiprocessing that automatically uses shared memory and Arrow for serialization. On Wed, May 16, 2018 at 10:02 AM Corey Nolet <cjno...@gmail.com> wrote: > I've been reading through the PyArrow documentation and trying to > understand how to use the tool effectively for IPC (using zero-copy). > > I'm on a system with 586 cores & 1TB of ram. I'm using Panda's Dataframes > to process several 10's of gigs of data in memory and the pickling that is > done by Python's multiprocessing API is very wasteful. > > I'm running a little hand-built map-reduce where I chunk the dataframe into > N_mappers number of chunks, run some processing on them, then run some > number N_reducers to finalize the operation. What I'd like to be able to do > is chunk up the dataframe into Arrow Buffer objects and just have each > mapped task read their respective Buffer object with the guarantee of > zero-copy. > > I see there's a couple Filesystem abstractions for doing memory-mapped > files. Durability isn't something I need and I'm willing to forego the > expense of putting the files on disk. > > Is it possible to write the data directly to memory and pass just the > reference around to the different processes? What's the recommended way to > accomplish my goal here? > > > Thanks in advance! >