Fwd: Apache Arrow at JupyterCon

Gang(Gary) Wang Wed, 06 Sep 2017 21:59:06 -0700

I forward this discussion thread here for your information, please join if
you are also interested in this topic.



---------- Forwarded message ----------
From: Jacques Nadeau <jacq...@apache.org>
Date: Wed, Sep 6, 2017 at 9:11 PM
Subject: Re: Apache Arrow at JupyterCon
To: d...@arrow.apache.org


This is a interesting problem but also pretty complex. Arrow's Java memory
management model is complex on purpose (see
https://github.com/apache/arrow/blob/master/java/memory/
src/main/java/org/apache/arrow/memory/README.md
for more info). It is designed to reserve and share memory in multiple
hierarchical domains (with reservations and limits) while providing
transfer semantics across those domains with minimal contention and
locking. An opaque (and potentially easy starting point would be to
optionally allow AllocationManager to use something other than the
PooledByteBufAllocatorL and UnsafeDirectLittleEndian for memory allocation.
This wouldn't expose movement between different memory tiers but that could
be managed underneath the Arrow system. At the end of the day, the whole
hierarchy is basically a collection of memory addresses, accounting and
reference counting.

A phase two could be a proposal which allows movement between memory
domains and could be generified across systems like Mnemonic as well
GPU/Device memory domains.


On Wed, Sep 6, 2017 at 4:45 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> Thanks Gary, that is helpful context. In light if this, it might be
> worth writing some kind of a proposal for how to enable the Java
> vector classes to be backed by some other kind of byte buffers. It
> might be that an alternative version of portions of the Arrow Java
> library (i.e. decoupled from Netty) might need to be created.
>
> If it cannot be reconciled with the Netty AbstractByteBuf class then
> this would be useful to know so that Arrow developers can plan
> accordingly for the future.
>
> On Wed, Sep 6, 2017 at 2:15 PM, Gary Wong <qich...@gmail.com> wrote:
> > The ArrowBuf is inherited from AbstractByteBuf, the AbstractByteBuf is
> > defined in the Netty library, it does more like a memory pool not a pure
> > buffer so that's why ArrowBuf is not backed by ByteBuffer as now.
> >
> > I have ever tried to make ArrowBuf build on top of DurableBuffer of
> > Mnemonic, but looks it is not very easy to decouple the refcount from
> other
> > parts because the lifecycle of DurableBuffer could also be managed by
> > JVM automatically instead of using refcount.
> >
> > I still want to figure out how gracefully to migrate the backend of
> > ArrowBuf from Netty to Mnemonic. In addition, DurableBuffer could bring
> > other benefits for Arrow e.g. persistent on any kind of memory service
> that
> > could make use of SSD, NVMe, Memory and NAS and more. in this way, Arrow
> is
> > able to break through the capacity limitation of system memory, avoid
the
> > SerDe for storage and link other durable objects with ease and etc.
> >
> >
> >
> >
> > On Wed, Sep 6, 2017 at 10:40 AM, Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> >> It should be possible to have an ArrowBuf backed by a
> >> MappedByteBuffer. Anyone reading is welcome to dig in and write a
> >> patch for this.
> >>
> >> Semantically this is what we have done in C++ -- a memory map inherits
> >> from arrow::Buffer, so we can slice and dice a memory map as we would
> >> any other Buffer object
> >>
> >> https://github.com/apache/arrow/blob/master/cpp/src/
> arrow/io/file.cc#L501
> >>
> >> On Mon, Sep 4, 2017 at 4:05 AM, Gonzalo Ortiz Jaureguizar
> >> <golthir...@gmail.com> wrote:
> >> > This is a very interesting feature. It's very surprising that there
> is no
> >> > ByteBuffer implementation backed on a MappedByteBuffer. As far as I
> >> > understand, it should be trivial to implement (maybe not to pool) as
> >> > usually ByteBuf is backed on a ByteBuffer and MappedByteBuffer
extends
> >> > that. But I didn't find implementations when I goggled for it.
> >> >
> >> > 2017-09-03 16:12 GMT+02:00 Wes McKinney <wesmck...@gmail.com>:
> >> >
> >> >> I think ideally we would have a Java interface that would support
all
> >> of:
> >> >>
> >> >> - Memory mapped files
> >> >> - Anonymous shared memory segments (e.g. POSIX shm)
> >> >> - NVM / Mnemonic
> >> >>
> >> >> We already have the ability to do zero-copy reads from buffer-like
> >> >> objects in C++ and IO interfaces that support zero copy (like memory
> >> >> mapped files). We can do zero-copy reads from ArrowBuf in Java but
we
> >> >> are missing the interfaces to shared memory sources
> >> >>
> >> >> - Wes
> >> >>
> >> >> On Thu, Aug 31, 2017 at 5:09 PM, Gang(Gary) Wang <ga...@apache.org>
> >> wrote:
> >> >> > Hi Wes,
> >> >> >
> >> >> > Thank you for the explanation. the usage of
> >> >> > https://issues.apache.org/jira/browse/ARROW-721 could be directly
> >> >> supported
> >> >> > by Mnemonic through DurableBuffer and DurableChunk, the
> DurableChunk
> >> >> makes
> >> >> > use of unsafe to expose a plain memory space for Arrow to use
> without
> >> >> > performance penalties. that's why most of the big data frameworks
> take
> >> >> the
> >> >> > advantage of unsafe, please refer to
> >> >> > https://mnemonic.apache.org/docs/domusecases.html for the use
> cases.
> >> we
> >> >> > could work on this ticket if you think that's exactly what you
> want.
> >> >> >
> >> >> > Regarding the NVM tech., that is what Mnemonic created for. it
> could
> >> be
> >> >> > used to directly persist Java generic objects and collection on
NVM
> >> with
> >> >> no
> >> >> > SerDe. so what kind of basic tools you mentioned? probably,  we
can
> >> help
> >> >> > also identify the gaps for Mnemonic as well. Thanks!
> >> >> >
> >> >> > Very truly yours,
> >> >> > Gary
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Thu, Aug 31, 2017 at 12:32 PM, Wes McKinney <
> wesmck...@gmail.com>
> >> >> wrote:
> >> >> >
> >> >> >> hi Gary,
> >> >> >>
> >> >> >> The Java libraries are not yet capable of writing or zero-copy
> reads
> >> >> >> of Arrow datasets to/from shared memory or memory-mapped files:
> >> >> >> https://issues.apache.org/jira/browse/ARROW-721. We've developed
> >> quite
> >> >> >> a bit of technology on the C++ side for dealing with shared
memory
> >> IPC
> >> >> >> but we need someone to help with that on the Java side.
> >> >> >>
> >> >> >> In the context of NVM technologies, it would be nice to be able
to
> >> >> >> persist a dataset to NVM and continue to do analytics on it,
while
> >> >> >> retaining a "handle" so that the dataset can be easily recovered
> in
> >> >> >> the event of process failure. We may arrive at new use cases once
> >> some
> >> >> >> of the basic tools exist.
> >> >> >>
> >> >> >> - Wes
> >> >> >>
> >> >> >> On Wed, Aug 30, 2017 at 6:19 PM, Gang(Gary) Wang <
> ga...@apache.org>
> >> >> wrote:
> >> >> >> > Thank you for sharing the videos. We are very interested in how
> to
> >> >> >> support
> >> >> >> > Arrow data format and collection very closely, could you please
> >> help
> >> >> to
> >> >> >> > point out which interfaces to allow Mnemonic act as a memory
> >> provider
> >> >> for
> >> >> >> > the user to store and access Arrow managed datasets ? Thanks!
> >> >> >> >
> >> >> >> > Very truly yours,
> >> >> >> > Gary.
> >> >> >> >
> >> >> >> >
> >> >> >> > On Wed, Aug 30, 2017 at 2:11 PM, Ivan Sadikov <
> >> ivan.sadi...@gmail.com
> >> >> >
> >> >> >> > wrote:
> >> >> >> >
> >> >> >> >> Great presentation! Thank you for sharing.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Thu, 31 Aug 2017 at 8:02 AM, Wes McKinney <
> wesmck...@gmail.com
> >> >
> >> >> >> wrote:
> >> >> >> >>
> >> >> >> >> > Absolutely. I will do that now
> >> >> >> >> >
> >> >> >> >> > On Wed, Aug 30, 2017 at 3:33 PM, Julian Hyde <
> jh...@apache.org>
> >> >> >> wrote:
> >> >> >> >> > > Thanks for sharing. Can we tweet those videos as well? I
> see
> >> that
> >> >> >> >> > https://twitter.com/apachearrow <https://twitter.com/
> >> apachearrow>
> >> >> >> only
> >> >> >> >> > tweeted your slides.
> >> >> >> >> > >
> >> >> >> >> > >> On Aug 26, 2017, at 1:11 PM, Wes McKinney <
> >> wesmck...@gmail.com>
> >> >> >> >> wrote:
> >> >> >> >> > >>
> >> >> >> >> > >> hi all,
> >> >> >> >> > >>
> >> >> >> >> > >> In case folks here are interested, I gave a keynote this
> >> week at
> >> >> >> >> > >> JupyterCon explaining my motivations for being involved
in
> >> >> Apache
> >> >> >> >> > >> Arrow and how I see it fitting in with the data science
> >> >> ecosystem
> >> >> >> long
> >> >> >> >> > >> term:
> >> >> >> >> > >>
> >> >> >> >> > >> https://www.youtube.com/watch?v=wdmf1msbtVs
> >> >> >> >> > >>
> >> >> >> >> > >> I also gave an interview going a little deeper into some
> of
> >> the
> >> >> >> topics
> >> >> >> >> > >> from the talk:
> >> >> >> >> > >>
> >> >> >> >> > >> https://www.youtube.com/watch?v=Q7y9l-L8yiU
> >> >> >> >> > >>
> >> >> >> >> > >> I believe we have an exciting journey ahead of us, but
> it's
> >> >> >> certainly
> >> >> >> >> > >> going to take a lot of collaboration and community
> >> development.
> >> >> >> >> > >>
> >> >> >> >> > >> - Wes
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
>

Fwd: Apache Arrow at JupyterCon

Reply via email to