Re: Non-chunked large files / hdf5 support

Wes McKinney Thu, 19 Dec 2019 06:37:45 -0800

On Tue, Dec 17, 2019 at 5:15 AM Maarten Breddels
<maartenbredd...@gmail.com> wrote:
>
> Hi,
>
> I had to catch up a bit with the arrow documentation before I could respond
> properly. My fear was that Arrow demanded that the in-memory representation
> was always 'packed', or 'flat'. After going through the docs, it seems that
> only when doing IPC or stream writing, it is written in this form. But it
> seems that e.g. ChunkedArray, StructArray can have their arrays/fields
> anywhere in memory. So given a set of contiguous (ignoring chunking)
> datasets in an hdf5 file, we should be able to memory map that and pass it
> to an Apache Arrow Table without any memory copy. Is this assumption
> correct?


Correct, yes.

>
> Indeed, assuming the story above is correct, Apache Arrow and hdf5 are
> completely orthogonal.

Yes, that's right.

>
> As mentioned before, I have a strong preference for having the ability to
> write out any size vaex DataFrame as 1 chunk. Having this in hdf5 will make
> the data trivial to read by anyone using an hdf5 library and then map the
> data to a single numpy array.

This would seem only relevant to numeric data having no nulls. I'm
sure there are problem domains where many datasets are like this, but
in the world of business analytics and databases it seems more
exceptional. NumPy arrays were always a poor fit as a backing data
structure for analytics but in 2008/2009 having NumPy interoperability
was important. It seems a great deal less important to me now,
particular with regards to non-numeric, nullable (of any type),
categorical, or nested data.

> I will probably explore this idea more within Vaex, in Python. I'll try to
> keep in touch with this, and would happily see this go into Apache Arrow. I
> hope we can formalize an intuitive way to write Apache Arrow Tables into
> hdf5.

I agree it would be useful, I will keep an eye out for what you learn.

> cheers,
>
> Maarten
>
>
>
>
>
>
> Op wo 27 nov. 2019 om 22:35 schreef Wes McKinney <wesmck...@gmail.com>:
>
> > hi,
> >
> > There have been a number of discussions over the years about on-disk
> > pre-allocation strategies. No volunteers have implemented anything,
> > though. Developing an HDF5 integration library with pre-allocation and
> > buffer management utilities seems like a reasonable growth area for
> > the project. The functionality provided by HDF5 and Apache Arrow (and
> > whether they're doing the same things -- which they aren't) has
> > actually been a common point of confusions for onlookers, so
> > clarifying that one can work together with the other might be helpful.
> >
> > Both in C++ and Python we have methods for assembling arrays and
> > record batches from mutable buffers, so if you allocate the buffers,
> > populate them, then you can assemble a record batch or table from them
> > in a straightforward manner.
> >
> > - Wes
> >
> >
> > On Tue, Nov 26, 2019 at 10:25 AM Francois Saint-Jacques
> > <fsaintjacq...@gmail.com> wrote:
> > >
> > > Hello Maarten,
> > >
> > > In theory, you could provide a custom mmap-allocator and use the
> > > builder facility. Since the array is still in "build-phase" and not
> > > sealed, it should be fine if mremap changes the pointer address. This
> > > might fail in practice since the allocator is also used for auxiliary
> > > data, e.g. dictionary hash table data in the case of Dictionary type.
> > >
> > >
> > > Another solution is to create a `FixedBuilder class where
> > > - the number of elements is known
> > > - the data type is of fixed width
> > > - Nullability is know (whether you need an extra buffer).
> > >
> > > I think sooner or later we'll need such class.
> > >
> > > François
> > >
> > > On Tue, Nov 26, 2019 at 10:01 AM Maarten Breddels
> > > <maartenbredd...@gmail.com> wrote:
> > > >
> > > > In vaex I always write the data to hdf5 as 1 large chunk (per column).
> > > > The reason is that it allows the mmapped columns to be exposed as a
> > > > single numpy array (talking numerical data only for now), which many
> > > > people are quite comfortable with.
> > > >
> > > > The strategy for vaex to write unchunked data, is to first create an
> > > > 'empty' hdf5 file (filled with zeros), mmap those huge arrays, and
> > > > write to that in chunks.
> > > >
> > > > This means that in vaex I need to support mutable data (only used
> > > > internally, vaex' default is immutable data like arrow), since I need
> > > > to write to the memory mapped data. It also makes the exporting code
> > > > relatively simple.
> > > >
> > > > I could not find a way in Arrow to get something similar done, at
> > > > least not without having a single pa.array instance for each column. I
> > > > think Arrow's mindset is that you should just use chunks right? Or is
> > > > this also something that can be considered for Arrow?
> > > >
> > > > An alternative would be to implement Arrow in hdf5, which I basically
> > > > do now in vaex (with limited support). Again, I'm wondering if there
> > > > is there an interest in storing arrow data in hdf5 from the Arrow
> > > > community?
> > > >
> > > > cheers,
> > > >
> > > > Maarten
> >

Re: Non-chunked large files / hdf5 support

Reply via email to