Re: Apache Arrow file format

Andrew Lamb Wed, 18 Oct 2023 13:20:38 -0700

If you are looking for a more formal discussion and empirical analysis of
the differences, I suggest reading "A Deep Dive into Common Open Formats
for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that
compares and contrasts Arrow, Parquet, ORC and Feather file formats.


[1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf

On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

> To further what others have already mentioned, the IPC file format is
> primarily optimised for IPC use-cases, that is exchanging the entire
> contents between processes. It is relatively inexpensive to encode and
> decode, and supports all arrow datatypes, making it ideal for things
> like spill-to-disk processing, distributed shuffles, etc...
>
> Parquet by comparison is a storage format, optimised for space
> efficiency and selective querying, with [1] containing an overview of
> the various techniques the format affords. It is comparatively expensive
> to encode and decode, and instead relies on index structures and
> statistics to accelerate access.
>
> Both are therefore perfectly viable options depending on your particular
> use-case.
>
> [1]:
>
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
>
> On 18/10/2023 13:59, Dewey Dunnington wrote:
> > Plenty of opinions here already, but I happen to think that IPC
> > streams and/or Arrow File/Feather are wildly underutilized. For the
> > use-case where you're mostly just going to read an entire file into R
> > or Python it's a bit faster (and far superior to a CSV or pickling or
> > .rds files in R).
> >
> >> you're going to read all the columns for a record batch in the file, no
> matter what
> > The metadata for each every column in every record batch has to be
> > read, but there's nothing inherent about the format that prevents
> > selectively loading into memory only the required buffers. (I don't
> > know off the top of my head if any reader implementation actually does
> > this).
> >
> > On Wed, Oct 18, 2023 at 12:02 AM wish maple <maplewish...@gmail.com>
> wrote:
> >> Arrow IPC file is great, it focuses on in-memory representation and
> direct
> >> computation.
> >> Basically, it can support compression and dictionary encoding, and can
> >> zero-copy
> >> deserialize the file to memory Arrow format.
> >>
> >> Parquet provides some strong functionality, like Statistics, which could
> >> help pruning
> >> unnecessary data during scanning and avoid cpu and io cust. And it has
> high
> >> efficient
> >> encoding, which could make the Parquet file smaller than the Arrow IPC
> file
> >> under the same
> >> data. However, currently some arrow data type cannot be convert to
> >> correspond Parquet type
> >> in the current arrow-cpp implementation. You can goto the arrow
> document to
> >> take a look.
> >>
> >> Adam Lippai <a...@rigo.sk> 于2023年10月18日周三 10:50写道：
> >>
> >>> Also there is
> >>> https://github.com/lancedb/lance between the two formats. Depending
> on the
> >>> use case it can be a great choice.
> >>>
> >>> Best regards
> >>> Adam Lippai
> >>>
> >>> On Tue, Oct 17, 2023 at 22:44 Matt Topol <zotthewiz...@gmail.com>
> wrote:
> >>>
> >>>> One benefit of the feather format (i.e. Arrow IPC file format) is the
> >>>> ability to mmap the file to easily handle reading sections of a larger
> >>> than
> >>>> memory file of data. Since, as Felipe mentioned, the format is
> focused on
> >>>> in-memory representation, you can easily and simply mmap the file and
> use
> >>>> the raw bytes directly. For a large file that you only want to read
> >>>> sections of, this can be beneficial for IO and memory usage.
> >>>>
> >>>> Unfortunately, you are correct that it doesn't allow for easy column
> >>>> projecting (you're going to read all the columns for a record batch in
> >>> the
> >>>> file, no matter what). So it's going to be a trade off based on your
> >>> needs
> >>>> as to whether it makes sense, or if you should use a file format like
> >>>> Parquet instead.
> >>>>
> >>>> -Matt
> >>>>
> >>>>
> >>>> On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
> >>>> felipe...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> It’s not the best since the format is really focused on in- memory
> >>>>> representation and direct computation, but you can do it:
> >>>>>
> >>>>> https://arrow.apache.org/docs/python/feather.html
> >>>>>
> >>>>> —
> >>>>> Felipe
> >>>>>
> >>>>> On Tue, 17 Oct 2023 at 23:26 Nara <narayanan.arunacha...@gmail.com>
> >>>> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Is it a good idea to use Apache Arrow as a file format? Looks like
> >>>>>> projecting columns isn't available by default.
> >>>>>>
> >>>>>> One of the benefits of Parquet file format is column projection,
> >>> where
> >>>>> the
> >>>>>> IO is limited to just the columns projected.
> >>>>>>
> >>>>>> Regards ,
> >>>>>> Nara
> >>>>>>
>

Re: Apache Arrow file format

Reply via email to