Re: Apache Arrow file format

Matt Topol Tue, 17 Oct 2023 19:43:32 -0700

One benefit of the feather format (i.e. Arrow IPC file format) is the
ability to mmap the file to easily handle reading sections of a larger than
memory file of data. Since, as Felipe mentioned, the format is focused on
in-memory representation, you can easily and simply mmap the file and use
the raw bytes directly. For a large file that you only want to read
sections of, this can be beneficial for IO and memory usage.

Unfortunately, you are correct that it doesn't allow for easy column
projecting (you're going to read all the columns for a record batch in the
file, no matter what). So it's going to be a trade off based on your needs
as to whether it makes sense, or if you should use a file format like
Parquet instead.

-Matt

On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <felipe...@gmail.com>
wrote:

> It’s not the best since the format is really focused on in- memory
> representation and direct computation, but you can do it:
>
> https://arrow.apache.org/docs/python/feather.html
>
> —
> Felipe
>
> On Tue, 17 Oct 2023 at 23:26 Nara <narayanan.arunacha...@gmail.com> wrote:
>
> > Hi,
> >
> > Is it a good idea to use Apache Arrow as a file format? Looks like
> > projecting columns isn't available by default.
> >
> > One of the benefits of Parquet file format is column projection, where
> the
> > IO is limited to just the columns projected.
> >
> > Regards ,
> > Nara
> >
>

Re: Apache Arrow file format

Reply via email to