Re: Apache Arrow file format

Weston Pace Sat, 21 Oct 2023 13:51:07 -0700

> Of course, what I'm really asking for is to see how Lance would compare
;-)


> P.S. The second paper [2] also talks about ML workloads (in Section 5.8)
> and GPU performance (in Section 5.9). It also cites Lance as one of the
> future formats (in Section 5.6.2).

Disclaimer: I work for LanceDb and am in no way an unbiased party.
However, since you asked:

TL;DR: Lance performs 10-20x better than orc or parquet when retrieving a
small scattered selection of rows from a large dataset.

I went ahead and reproduced the experiment in the second paper using
Lance.  Specifically, a vector search for 10 elements against the first 100
million rows of the laion 5b dataset.  There were a few details missing in
the paper (specifically around what index they used) and I ended up
training a rather underwhelming index but the performance of the index is
unrelated to the file format and so irrelevant for this discussion anyways.

Vector searches perform a CPU intensive index search against a relatively
small index (in the paper this index was kept in memory and so I did the
same for my experiment).  This identifies the rows of interest.  We then
need to perform a take operation to select those rows from storage.  This
is the part where the file format matters. So all we are really measuring
here is how long it takes to select N rows at random from a dataset.  This
is one of the use cases Lance was designed for and so it is no surprise
that it performs better.

Note that Lance stores data uncompressed.  However, it probably doesn't
matter in this case.  100 million rows of Laion 5B requires ~320GB.  Only
20GB of this is metadata.  The remaining 300GB is text & image embeddings.
These embeddings are, by design, not very compressible.  The entire lance
dataset required 330GB.

# Results:

The chart in the paper is quite small and uses a log scale.  I had to infer
the performance numbers for parquet & orc as best I could.  The numbers for
lance are accurate as that is what I measured.  These results are averaged
from 64 randomized queries (each iteration ran a single query to return 10
results) with the kernel's disk cache cleared (same as the paper I believe).

## S3:

Parquet: ~12,000ms
Orc: ~80,000ms
Lance: 1,696ms

## Local Storage (SSD):

Parquet: ~800ms
Orc: ~65ms (10ms spent searching index)
Lance: 61ms (59ms spent searching index)

At first glance it may seem like Lance performs about the same as Orc with
an SSD.  However, this is likely because my index was suboptimal (I did not
spend any real time tuning it since I could just look at the I/O times
directly).  The lance format spent only 2ms on I/O compared with ~55ms
spent on I/O by Orc.

# Boring Details:

Index: IVF/PQ with 1000 IVF partitions (probably should have been 10k
partitions but I'm not patient enough) and 96 PQ subvectors (1 byte per
subvector)
Hardware: Tests were performed on an r6id.2xlarge (using the attached NVME
for the SSD tests) in the same region as the S3 storage

Minor detail: The embeddings provided with the laion 5b dataset (clip
vit-l/14) were provided as float16.  Lance doesn't yet support float16 and
so I inflated these to float32 (that doubles the amount of data retrieved
so, if anything, it's just making things harder on lance)

On Thu, Oct 19, 2023 at 9:55 AM Aldrin <octalene....@pm.me.invalid> wrote:

> And the first paper's reference of arrow (in the references section) lists
> 2022 as the date of last access.
>
> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>
>
> On Thu, Oct 19, 2023 at 18:51, Aldrin <octalene....@pm.me.INVALID
> <On+Thu,+Oct+19,+2023+at+18:51,+Aldrin+%3C%3Ca+href=>> wrote:
>
> For context, that second referenced paper has Wes McKinney as a co-author,
> so they were much better positioned to say "the right things."
>
> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>
>
> On Thu, Oct 19, 2023 at 18:38, Jin Shang <shangjin1...@gmail.com
> <On+Thu,+Oct+19,+2023+at+18:38,+Jin+Shang+%3C%3Ca+href=>> wrote:
>
> Honestly I don't understand why this VLDB paper [1] chooses to include
> Feather in their evaluations. This paper studies OLAP DBMS file formats.
> Feather is clearly not optimized for the workload and performs badly in
> most of their benchmarks. This paper also has several inaccurate or
> outdated claims about Arrow, e.g. Arrow has no run length encoding, Arrow's
> dictionary encoding only supports string types (Table 3 and 5), Feather is
> Arrow plus dictionary encoding and compression (Section 3.2) etc. Moreover,
> the two optimizations it proposes for Arrow (in Section 8.1.1 and 8.1.3)
> are actually just two new APIs for working with Arrow data that require no
> change to the Arrow format itself. I fear that this paper may actually
> discourage DB people from using Arrow as their *in memory *format, even
> though it's the Arrow *file* format that performs badly for their workload.
>
> There is another paper "An Empirical Evaluation of Columnar Storage
> Formats" [2] covering essentially the same topic. It however chooses not to
> evaluate Arrow (in Section 2) because "Arrow is not meant for long-term
> disk storage", citing Wes McKinney's blog post [3] from six years ago. (Wes
> is also a co-author of this paper). Interestingly, the VLDB paper [1] also
> talks about the two blog posts [3][4] (in Section 9) and stated that
> "several limitations (of Arrow) described in [4] persist to this day",
> which IMHO is not true.
>
> P.S. The second paper [2] also talks about ML workloads (in Section 5.8)
> and GPU performance (in Section 5.9). It also cites Lance as one of the
> future formats (in Section 5.6.2).
>
> [1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf
> [2] https://arxiv.org/abs/2304.05028
> [3] https://wesmckinney.com/blog/arrow-columnar-abadi/
> [4]
>
> https://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html
>
> On Thu, Oct 19, 2023 at 7:38 PM Roman Shaposhnik <ro...@shaposhnik.org>
> wrote:
>
> > On Wed, Oct 18, 2023 at 11:20 PM Andrew Lamb <al...@influxdata.com>
> wrote:
> > >
> > > If you are looking for a more formal discussion and empirical analysis
> of
> > > the differences, I suggest reading "A Deep Dive into Common Open
> Formats
> > > for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that
> > > compares and contrasts Arrow, Parquet, ORC and Feather file formats.
> > >
> > > [1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf
> >
> > This is a very useful article, but it seems to be taking a DBMS angle.
> > I'm wondering
> > if anyone has seen similar research but with more of an ML/DL angle
> taken.
> >
> > Of course, what I'm really asking for is to see how Lance would compare
> ;-)
> >
> > Thanks,
> > Roman.
> >
> > >
> > > On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies
> > > <r.taylordav...@googlemail.com.invalid> wrote:
> > >
> > > > To further what others have already mentioned, the IPC file format is
> > > > primarily optimised for IPC use-cases, that is exchanging the entire
> > > > contents between processes. It is relatively inexpensive to encode
> and
> > > > decode, and supports all arrow datatypes, making it ideal for things
> > > > like spill-to-disk processing, distributed shuffles, etc...
> > > >
> > > > Parquet by comparison is a storage format, optimised for space
> > > > efficiency and selective querying, with [1] containing an overview of
> > > > the various techniques the format affords. It is comparatively
> > expensive
> > > > to encode and decode, and instead relies on index structures and
> > > > statistics to accelerate access.
> > > >
> > > > Both are therefore perfectly viable options depending on your
> > particular
> > > > use-case.
> > > >
> > > > [1]:
> > > >
> > > >
> >
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> > > >
> > > > On 18/10/2023 13:59, Dewey Dunnington wrote:
> > > > > Plenty of opinions here already, but I happen to think that IPC
> > > > > streams and/or Arrow File/Feather are wildly underutilized. For the
> > > > > use-case where you're mostly just going to read an entire file
> into R
> > > > > or Python it's a bit faster (and far superior to a CSV or pickling
> or
> > > > > .rds files in R).
> > > > >
> > > > >> you're going to read all the columns for a record batch in the
> > file, no
> > > > matter what
> > > > > The metadata for each every column in every record batch has to be
> > > > > read, but there's nothing inherent about the format that prevents
> > > > > selectively loading into memory only the required buffers. (I don't
> > > > > know off the top of my head if any reader implementation actually
> > does
> > > > > this).
> > > > >
> > > > > On Wed, Oct 18, 2023 at 12:02 AM wish maple <
> maplewish...@gmail.com>
> > > > wrote:
> > > > >> Arrow IPC file is great, it focuses on in-memory representation
> and
> > > > direct
> > > > >> computation.
> > > > >> Basically, it can support compression and dictionary encoding, and
> > can
> > > > >> zero-copy
> > > > >> deserialize the file to memory Arrow format.
> > > > >>
> > > > >> Parquet provides some strong functionality, like Statistics, which
> > could
> > > > >> help pruning
> > > > >> unnecessary data during scanning and avoid cpu and io cust. And it
> > has
> > > > high
> > > > >> efficient
> > > > >> encoding, which could make the Parquet file smaller than the Arrow
> > IPC
> > > > file
> > > > >> under the same
> > > > >> data. However, currently some arrow data type cannot be convert to
> > > > >> correspond Parquet type
> > > > >> in the current arrow-cpp implementation. You can goto the arrow
> > > > document to
> > > > >> take a look.
> > > > >>
> > > > >> Adam Lippai <a...@rigo.sk> 于2023年10月18日周三 10:50写道：
> > > > >>
> > > > >>> Also there is
> > > > >>> https://github.com/lancedb/lance between the two formats.
> > Depending
> > > > on the
> > > > >>> use case it can be a great choice.
> > > > >>>
> > > > >>> Best regards
> > > > >>> Adam Lippai
> > > > >>>
> > > > >>> On Tue, Oct 17, 2023 at 22:44 Matt Topol <zotthewiz...@gmail.com
> >
> > > > wrote:
> > > > >>>
> > > > >>>> One benefit of the feather format (i.e. Arrow IPC file format)
> is
> > the
> > > > >>>> ability to mmap the file to easily handle reading sections of a
> > larger
> > > > >>> than
> > > > >>>> memory file of data. Since, as Felipe mentioned, the format is
> > > > focused on
> > > > >>>> in-memory representation, you can easily and simply mmap the
> file
> > and
> > > > use
> > > > >>>> the raw bytes directly. For a large file that you only want to
> > read
> > > > >>>> sections of, this can be beneficial for IO and memory usage.
> > > > >>>>
> > > > >>>> Unfortunately, you are correct that it doesn't allow for easy
> > column
> > > > >>>> projecting (you're going to read all the columns for a record
> > batch in
> > > > >>> the
> > > > >>>> file, no matter what). So it's going to be a trade off based on
> > your
> > > > >>> needs
> > > > >>>> as to whether it makes sense, or if you should use a file format
> > like
> > > > >>>> Parquet instead.
> > > > >>>>
> > > > >>>> -Matt
> > > > >>>>
> > > > >>>>
> > > > >>>> On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
> > > > >>>> felipe...@gmail.com>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> It’s not the best since the format is really focused on in-
> > memory
> > > > >>>>> representation and direct computation, but you can do it:
> > > > >>>>>
> > > > >>>>> https://arrow.apache.org/docs/python/feather.html
> > > > >>>>>
> > > > >>>>> —
> > > > >>>>> Felipe
> > > > >>>>>
> > > > >>>>> On Tue, 17 Oct 2023 at 23:26 Nara <
> > narayanan.arunacha...@gmail.com>
> > > > >>>> wrote:
> > > > >>>>>> Hi,
> > > > >>>>>>
> > > > >>>>>> Is it a good idea to use Apache Arrow as a file format? Looks
> > like
> > > > >>>>>> projecting columns isn't available by default.
> > > > >>>>>>
> > > > >>>>>> One of the benefits of Parquet file format is column
> projection,
> > > > >>> where
> > > > >>>>> the
> > > > >>>>>> IO is limited to just the columns projected.
> > > > >>>>>>
> > > > >>>>>> Regards ,
> > > > >>>>>> Nara
> > > > >>>>>>
> > > >
> >
>
>

Re: Apache Arrow file format

Reply via email to