Re: Apache Arrow file format

Jacek Pliszka Thu, 19 Oct 2023 00:02:54 -0700

There is a note there explaining what they understand by it but
further down the line they do not make such distinction.


The fact that parquet can be better in-memory format than arrow for
certain common uses is something I haven't thought of
and is eye-opening for me, admittedly so because I am not up to date
on the topic.

On the other hand it shows which parquet features can be useful for
arrow to match the performance.

BR,

Jacek

czw., 19 paź 2023 o 00:17 Antoine Pitrou <[email protected]> napisał(a):
>
>
> The fact that they describe Arrow and Feather as distinct formats
> (they're not!) with different characteristics is a bit of a bummer.
>
>
> Le 18/10/2023 à 22:20, Andrew Lamb a écrit :
> > If you are looking for a more formal discussion and empirical analysis of
> > the differences, I suggest reading "A Deep Dive into Common Open Formats
> > for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that
> > compares and contrasts Arrow, Parquet, ORC and Feather file formats.
> >
> > [1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf
> >
> > On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies
> > <[email protected]> wrote:
> >
> >> To further what others have already mentioned, the IPC file format is
> >> primarily optimised for IPC use-cases, that is exchanging the entire
> >> contents between processes. It is relatively inexpensive to encode and
> >> decode, and supports all arrow datatypes, making it ideal for things
> >> like spill-to-disk processing, distributed shuffles, etc...
> >>
> >> Parquet by comparison is a storage format, optimised for space
> >> efficiency and selective querying, with [1] containing an overview of
> >> the various techniques the format affords. It is comparatively expensive
> >> to encode and decode, and instead relies on index structures and
> >> statistics to accelerate access.
> >>
> >> Both are therefore perfectly viable options depending on your particular
> >> use-case.
> >>
> >> [1]:
> >>
> >> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> >>
> >> On 18/10/2023 13:59, Dewey Dunnington wrote:
> >>> Plenty of opinions here already, but I happen to think that IPC
> >>> streams and/or Arrow File/Feather are wildly underutilized. For the
> >>> use-case where you're mostly just going to read an entire file into R
> >>> or Python it's a bit faster (and far superior to a CSV or pickling or
> >>> .rds files in R).
> >>>
> >>>> you're going to read all the columns for a record batch in the file, no
> >> matter what
> >>> The metadata for each every column in every record batch has to be
> >>> read, but there's nothing inherent about the format that prevents
> >>> selectively loading into memory only the required buffers. (I don't
> >>> know off the top of my head if any reader implementation actually does
> >>> this).
> >>>
> >>> On Wed, Oct 18, 2023 at 12:02 AM wish maple <[email protected]>
> >> wrote:
> >>>> Arrow IPC file is great, it focuses on in-memory representation and
> >> direct
> >>>> computation.
> >>>> Basically, it can support compression and dictionary encoding, and can
> >>>> zero-copy
> >>>> deserialize the file to memory Arrow format.
> >>>>
> >>>> Parquet provides some strong functionality, like Statistics, which could
> >>>> help pruning
> >>>> unnecessary data during scanning and avoid cpu and io cust. And it has
> >> high
> >>>> efficient
> >>>> encoding, which could make the Parquet file smaller than the Arrow IPC
> >> file
> >>>> under the same
> >>>> data. However, currently some arrow data type cannot be convert to
> >>>> correspond Parquet type
> >>>> in the current arrow-cpp implementation. You can goto the arrow
> >> document to
> >>>> take a look.
> >>>>
> >>>> Adam Lippai <[email protected]> 于2023年10月18日周三 10:50写道：
> >>>>
> >>>>> Also there is
> >>>>> https://github.com/lancedb/lance between the two formats. Depending
> >> on the
> >>>>> use case it can be a great choice.
> >>>>>
> >>>>> Best regards
> >>>>> Adam Lippai
> >>>>>
> >>>>> On Tue, Oct 17, 2023 at 22:44 Matt Topol <[email protected]>
> >> wrote:
> >>>>>
> >>>>>> One benefit of the feather format (i.e. Arrow IPC file format) is the
> >>>>>> ability to mmap the file to easily handle reading sections of a larger
> >>>>> than
> >>>>>> memory file of data. Since, as Felipe mentioned, the format is
> >> focused on
> >>>>>> in-memory representation, you can easily and simply mmap the file and
> >> use
> >>>>>> the raw bytes directly. For a large file that you only want to read
> >>>>>> sections of, this can be beneficial for IO and memory usage.
> >>>>>>
> >>>>>> Unfortunately, you are correct that it doesn't allow for easy column
> >>>>>> projecting (you're going to read all the columns for a record batch in
> >>>>> the
> >>>>>> file, no matter what). So it's going to be a trade off based on your
> >>>>> needs
> >>>>>> as to whether it makes sense, or if you should use a file format like
> >>>>>> Parquet instead.
> >>>>>>
> >>>>>> -Matt
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
> >>>>>> [email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> It’s not the best since the format is really focused on in- memory
> >>>>>>> representation and direct computation, but you can do it:
> >>>>>>>
> >>>>>>> https://arrow.apache.org/docs/python/feather.html
> >>>>>>>
> >>>>>>> —
> >>>>>>> Felipe
> >>>>>>>
> >>>>>>> On Tue, 17 Oct 2023 at 23:26 Nara <[email protected]>
> >>>>>> wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Is it a good idea to use Apache Arrow as a file format? Looks like
> >>>>>>>> projecting columns isn't available by default.
> >>>>>>>>
> >>>>>>>> One of the benefits of Parquet file format is column projection,
> >>>>> where
> >>>>>>> the
> >>>>>>>> IO is limited to just the columns projected.
> >>>>>>>>
> >>>>>>>> Regards ,
> >>>>>>>> Nara
> >>>>>>>>
> >>
> >

Re: Apache Arrow file format

Reply via email to