The fact that they describe Arrow and Feather as distinct formats (they're not!) with different characteristics is a bit of a bummer.


Le 18/10/2023 à 22:20, Andrew Lamb a écrit :
If you are looking for a more formal discussion and empirical analysis of
the differences, I suggest reading "A Deep Dive into Common Open Formats
for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that
compares and contrasts Arrow, Parquet, ORC and Feather file formats.

[1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf

On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

To further what others have already mentioned, the IPC file format is
primarily optimised for IPC use-cases, that is exchanging the entire
contents between processes. It is relatively inexpensive to encode and
decode, and supports all arrow datatypes, making it ideal for things
like spill-to-disk processing, distributed shuffles, etc...

Parquet by comparison is a storage format, optimised for space
efficiency and selective querying, with [1] containing an overview of
the various techniques the format affords. It is comparatively expensive
to encode and decode, and instead relies on index structures and
statistics to accelerate access.

Both are therefore perfectly viable options depending on your particular
use-case.

[1]:

https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/

On 18/10/2023 13:59, Dewey Dunnington wrote:
Plenty of opinions here already, but I happen to think that IPC
streams and/or Arrow File/Feather are wildly underutilized. For the
use-case where you're mostly just going to read an entire file into R
or Python it's a bit faster (and far superior to a CSV or pickling or
.rds files in R).

you're going to read all the columns for a record batch in the file, no
matter what
The metadata for each every column in every record batch has to be
read, but there's nothing inherent about the format that prevents
selectively loading into memory only the required buffers. (I don't
know off the top of my head if any reader implementation actually does
this).

On Wed, Oct 18, 2023 at 12:02 AM wish maple <maplewish...@gmail.com>
wrote:
Arrow IPC file is great, it focuses on in-memory representation and
direct
computation.
Basically, it can support compression and dictionary encoding, and can
zero-copy
deserialize the file to memory Arrow format.

Parquet provides some strong functionality, like Statistics, which could
help pruning
unnecessary data during scanning and avoid cpu and io cust. And it has
high
efficient
encoding, which could make the Parquet file smaller than the Arrow IPC
file
under the same
data. However, currently some arrow data type cannot be convert to
correspond Parquet type
in the current arrow-cpp implementation. You can goto the arrow
document to
take a look.

Adam Lippai <a...@rigo.sk> 于2023年10月18日周三 10:50写道:

Also there is
https://github.com/lancedb/lance between the two formats. Depending
on the
use case it can be a great choice.

Best regards
Adam Lippai

On Tue, Oct 17, 2023 at 22:44 Matt Topol <zotthewiz...@gmail.com>
wrote:

One benefit of the feather format (i.e. Arrow IPC file format) is the
ability to mmap the file to easily handle reading sections of a larger
than
memory file of data. Since, as Felipe mentioned, the format is
focused on
in-memory representation, you can easily and simply mmap the file and
use
the raw bytes directly. For a large file that you only want to read
sections of, this can be beneficial for IO and memory usage.

Unfortunately, you are correct that it doesn't allow for easy column
projecting (you're going to read all the columns for a record batch in
the
file, no matter what). So it's going to be a trade off based on your
needs
as to whether it makes sense, or if you should use a file format like
Parquet instead.

-Matt


On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
felipe...@gmail.com>
wrote:

It’s not the best since the format is really focused on in- memory
representation and direct computation, but you can do it:

https://arrow.apache.org/docs/python/feather.html

—
Felipe

On Tue, 17 Oct 2023 at 23:26 Nara <narayanan.arunacha...@gmail.com>
wrote:
Hi,

Is it a good idea to use Apache Arrow as a file format? Looks like
projecting columns isn't available by default.

One of the benefits of Parquet file format is column projection,
where
the
IO is limited to just the columns projected.

Regards ,
Nara



Reply via email to