Re: Non-chunked large files / hdf5 support

Francois Saint-Jacques Tue, 26 Nov 2019 08:26:19 -0800

Hello Maarten,

In theory, you could provide a custom mmap-allocator and use the
builder facility. Since the array is still in "build-phase" and not
sealed, it should be fine if mremap changes the pointer address. This
might fail in practice since the allocator is also used for auxiliary
data, e.g. dictionary hash table data in the case of Dictionary type.



Another solution is to create a `FixedBuilder class where
- the number of elements is known
- the data type is of fixed width
- Nullability is know (whether you need an extra buffer).

I think sooner or later we'll need such class.

François

On Tue, Nov 26, 2019 at 10:01 AM Maarten Breddels
<maartenbredd...@gmail.com> wrote:
>
> In vaex I always write the data to hdf5 as 1 large chunk (per column).
> The reason is that it allows the mmapped columns to be exposed as a
> single numpy array (talking numerical data only for now), which many
> people are quite comfortable with.
>
> The strategy for vaex to write unchunked data, is to first create an
> 'empty' hdf5 file (filled with zeros), mmap those huge arrays, and
> write to that in chunks.
>
> This means that in vaex I need to support mutable data (only used
> internally, vaex' default is immutable data like arrow), since I need
> to write to the memory mapped data. It also makes the exporting code
> relatively simple.
>
> I could not find a way in Arrow to get something similar done, at
> least not without having a single pa.array instance for each column. I
> think Arrow's mindset is that you should just use chunks right? Or is
> this also something that can be considered for Arrow?
>
> An alternative would be to implement Arrow in hdf5, which I basically
> do now in vaex (with limited support). Again, I'm wondering if there
> is there an interest in storing arrow data in hdf5 from the Arrow
> community?
>
> cheers,
>
> Maarten

Re: Non-chunked large files / hdf5 support

Reply via email to