Re: Apache Arrow file format

Aldrin Thu, 19 Oct 2023 09:51:52 -0700

For context, that second referenced paper has Wes McKinney as a co-author, so they were much better positioned to say "the right things."

Sent from Proton Mail for iOS

On Thu, Oct 19, 2023 at 18:38, Jin Shang <shangjin1...@gmail.com> wrote:

Honestly I don't understand why this VLDB paper [1] chooses to include
Feather in their evaluations. This paper studies OLAP DBMS file formats.
Feather is clearly not optimized for the workload and performs badly in
most of their benchmarks. This paper also has several inaccurate or
outdated claims about Arrow, e.g. Arrow has no run length encoding, Arrow's
dictionary encoding only supports string types (Table 3 and 5), Feather is
Arrow plus dictionary encoding and compression (Section 3.2) etc. Moreover,
the two optimizations it proposes for Arrow (in Section 8.1.1 and 8.1.3)
are actually just two new APIs for working with Arrow data that require no
change to the Arrow format itself. I fear that this paper may actually
discourage DB people from using Arrow as their *in memory *format, even
though it's the Arrow *file* format that performs badly for their workload.

There is another paper "An Empirical Evaluation of Columnar Storage
Formats" [2] covering essentially the same topic. It however chooses not to
evaluate Arrow (in Section 2) because "Arrow is not meant for long-term
disk storage", citing Wes McKinney's blog post [3] from six years ago. (Wes
is also a co-author of this paper). Interestingly, the VLDB paper [1] also
talks about the two blog posts [3][4] (in Section 9) and stated that
"several limitations (of Arrow) described in [4] persist to this day",
which IMHO is not true.

P.S. The second paper [2] also talks about ML workloads (in Section 5.8)
and GPU performance (in Section 5.9). It also cites Lance as one of the
future formats (in Section 5.6.2).

[1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf
[2] https://arxiv.org/abs/2304.05028
[3] https://wesmckinney.com/blog/arrow-columnar-abadi/
[4]
https://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html

On Thu, Oct 19, 2023 at 7:38 PM Roman Shaposhnik <ro...@shaposhnik.org>
wrote:

> On Wed, Oct 18, 2023 at 11:20 PM Andrew Lamb <al...@influxdata.com> wrote:
> >
> > If you are looking for a more formal discussion and empirical analysis of
> > the differences, I suggest reading "A Deep Dive into Common Open Formats
> > for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that
> > compares and contrasts Arrow, Parquet, ORC and Feather file formats.
> >
> > [1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf
>
> This is a very useful article, but it seems to be taking a DBMS angle.
> I'm wondering
> if anyone has seen similar research but with more of an ML/DL angle taken.
>
> Of course, what I'm really asking for is to see how Lance would compare ;-)
>
> Thanks,
> Roman.
>
> >
> > On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies
> > <r.taylordav...@googlemail.com.invalid> wrote:
> >
> > > To further what others have already mentioned, the IPC file format is
> > > primarily optimised for IPC use-cases, that is exchanging the entire
> > > contents between processes. It is relatively inexpensive to encode and
> > > decode, and supports all arrow datatypes, making it ideal for things
> > > like spill-to-disk processing, distributed shuffles, etc...
> > >
> > > Parquet by comparison is a storage format, optimised for space
> > > efficiency and selective querying, with [1] containing an overview of
> > > the various techniques the format affords. It is comparatively
> expensive
> > > to encode and decode, and instead relies on index structures and
> > > statistics to accelerate access.
> > >
> > > Both are therefore perfectly viable options depending on your
> particular
> > > use-case.
> > >
> > > [1]:
> > >
> > >
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> > >
> > > On 18/10/2023 13:59, Dewey Dunnington wrote:
> > > > Plenty of opinions here already, but I happen to think that IPC
> > > > streams and/or Arrow File/Feather are wildly underutilized. For the
> > > > use-case where you're mostly just going to read an entire file into R
> > > > or Python it's a bit faster (and far superior to a CSV or pickling or
> > > > .rds files in R).
> > > >
> > > >> you're going to read all the columns for a record batch in the
> file, no
> > > matter what
> > > > The metadata for each every column in every record batch has to be
> > > > read, but there's nothing inherent about the format that prevents
> > > > selectively loading into memory only the required buffers. (I don't
> > > > know off the top of my head if any reader implementation actually
> does
> > > > this).
> > > >
> > > > On Wed, Oct 18, 2023 at 12:02 AM wish maple <maplewish...@gmail.com>
> > > wrote:
> > > >> Arrow IPC file is great, it focuses on in-memory representation and
> > > direct
> > > >> computation.
> > > >> Basically, it can support compression and dictionary encoding, and
> can
> > > >> zero-copy
> > > >> deserialize the file to memory Arrow format.
> > > >>
> > > >> Parquet provides some strong functionality, like Statistics, which
> could
> > > >> help pruning
> > > >> unnecessary data during scanning and avoid cpu and io cust. And it
> has
> > > high
> > > >> efficient
> > > >> encoding, which could make the Parquet file smaller than the Arrow
> IPC
> > > file
> > > >> under the same
> > > >> data. However, currently some arrow data type cannot be convert to
> > > >> correspond Parquet type
> > > >> in the current arrow-cpp implementation. You can goto the arrow
> > > document to
> > > >> take a look.
> > > >>
> > > >> Adam Lippai <a...@rigo.sk> 于2023年10月18日周三 10:50写道：
> > > >>
> > > >>> Also there is
> > > >>> https://github.com/lancedb/lance between the two formats.
> Depending
> > > on the
> > > >>> use case it can be a great choice.
> > > >>>
> > > >>> Best regards
> > > >>> Adam Lippai
> > > >>>
> > > >>> On Tue, Oct 17, 2023 at 22:44 Matt Topol <zotthewiz...@gmail.com>
> > > wrote:
> > > >>>
> > > >>>> One benefit of the feather format (i.e. Arrow IPC file format) is
> the
> > > >>>> ability to mmap the file to easily handle reading sections of a
> larger
> > > >>> than
> > > >>>> memory file of data. Since, as Felipe mentioned, the format is
> > > focused on
> > > >>>> in-memory representation, you can easily and simply mmap the file
> and
> > > use
> > > >>>> the raw bytes directly. For a large file that you only want to
> read
> > > >>>> sections of, this can be beneficial for IO and memory usage.
> > > >>>>
> > > >>>> Unfortunately, you are correct that it doesn't allow for easy
> column
> > > >>>> projecting (you're going to read all the columns for a record
> batch in
> > > >>> the
> > > >>>> file, no matter what). So it's going to be a trade off based on
> your
> > > >>> needs
> > > >>>> as to whether it makes sense, or if you should use a file format
> like
> > > >>>> Parquet instead.
> > > >>>>
> > > >>>> -Matt
> > > >>>>
> > > >>>>
> > > >>>> On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
> > > >>>> felipe...@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> It’s not the best since the format is really focused on in-
> memory
> > > >>>>> representation and direct computation, but you can do it:
> > > >>>>>
> > > >>>>> https://arrow.apache.org/docs/python/feather.html
> > > >>>>>
> > > >>>>> —
> > > >>>>> Felipe
> > > >>>>>
> > > >>>>> On Tue, 17 Oct 2023 at 23:26 Nara <
> narayanan.arunacha...@gmail.com>
> > > >>>> wrote:
> > > >>>>>> Hi,
> > > >>>>>>
> > > >>>>>> Is it a good idea to use Apache Arrow as a file format? Looks
> like
> > > >>>>>> projecting columns isn't available by default.
> > > >>>>>>
> > > >>>>>> One of the benefits of Parquet file format is column projection,
> > > >>> where
> > > >>>>> the
> > > >>>>>> IO is limited to just the columns projected.
> > > >>>>>>
> > > >>>>>> Regards ,
> > > >>>>>> Nara
> > > >>>>>>
> > >
>

signature.asc
Description: OpenPGP digital signature

Re: Apache Arrow file format

Reply via email to