Re: Apache Arrow file format

Yue Ni Sun, 22 Oct 2023 20:28:55 -0700

> Looks like projecting columns isn't available by default.

> One of the benefits of Parquet file format is column projection, where
the IO is limited to just the columns projected.


> Unfortunately, you are correct that it doesn't allow for easy column
projecting (you're going to read all the columns for a record batch in the
file, no matter what)

Selective column reading is possible for arrow IPC files. The issue used
for tracking this problem of arrow C++ implementation is [1], and it was
addressed in arrow 7.0 in PR [2]. So if you are using a recent version of
arrow C++ `RecordBatchFileReader` to read the IPC file, and you can use the
`IpcReadOptions` along with its `std::vector<int> included_fields` option
to achieve this. But other language's implementations' IPC files reader may
not implement this capability yet.

If an IPC file contains many record batches in it, reading columns
selectively from all these batches may still not perform the best since
currently each record batch in an IPC file stores its schema related
metadata at the end of each batch, which requires random access to fetch,
so if you are reading specific columns from multiple batches in an IPC
file, your application will encounter many sequential IO (for fetching
columnar data) along with many random access IO (for fetching the fields
metadata such as their offsets). Depending on your storage, these random
access IO may/may not slow down the file reading. C++'s implementation has
some API like `SelectiveIpcFileRecordBatchGenerator` which supports
pre-buffering metadata, which may help the performance a bit.

[1] https://github.com/apache/arrow/issues/28430,
https://issues.apache.org/jira/browse/ARROW-12683
[2] https://github.com/apache/arrow/pull/11486

Regards,
Yue

On Sun, Oct 22, 2023 at 5:40 PM wish maple <maplewish...@gmail.com> wrote:

> IMO, Facebook has mentioned a format for ML in it's paper section 3.3.2[1].
> It mentions that
>
> > ML tables are also typically much wider, and tend to have tens of
> thousands
> > of features usually stored as large maps.
> >
> > The most pressing issue with the DWRF format was metadata overhead;
> > our ML use cases needed a very large number of features (typically stored
> as giant maps),
> > and the DWRF map format, albeit optimized, had too much metadata
> overhead.
> > Apart from this, DWRF had several other limitations related to encodings
> and stripe structure,
> > which were very difficult to fix in a backward-compatible way.
>
> So, I think Lance may bring better performance on these specific workload.
> Furthermore, it
> has rich and powerful index, which make it a good choice for ML system.
>
> However, I think Parquet is most universal columnar file format. And user
> can tune the Parquet file
> with disable some statistics on some column, adjust some page type, disable
> some compression
> or using LZ4 like SingleStore[2] to accelerate point get workload.
> Furthermore, some types like
> fp16 in Parquet is on-going now.
>
> Besides, as parquet-rs maintainer mentioned in blog[3]. The implemention
> detail is essential for
> a format. Some features like schema evolution, pruning and IO is different
> in the different
> implementions, they're all essential to the reading performance when using
> them. I think some
> benchmarks on ORC/Parquet and other column format ignore this most
> essential part. Since
> parquet/ORC is only a feature rich format, user can do so many hacks and
> supports on it.
>
> I think maybe support a "columnar format" interface, and testing your
> workload using different
> format is better. Nobody can know your system better than yourself :-).
>
> Best,
> Xuwei Fu
>
> [1]
>
> https://research.facebook.com/publications/shared-foundations-modernizing-metas-data-lakehouse/
> [2] https://dl.acm.org/doi/10.1145/3514221.3526055
> [3]
>
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> [4] https://blog.getdaft.io/p/working-with-the-apache-parquet-file
>
> Weston Pace <weston.p...@gmail.com> 于2023年10月22日周日 04:51写道：
>
> > > Of course, what I'm really asking for is to see how Lance would compare
> > ;-)
> >
> > > P.S. The second paper [2] also talks about ML workloads (in Section
> 5.8)
> > > and GPU performance (in Section 5.9). It also cites Lance as one of the
> > > future formats (in Section 5.6.2).
> >
> > Disclaimer: I work for LanceDb and am in no way an unbiased party.
> > However, since you asked:
> >
> > TL;DR: Lance performs 10-20x better than orc or parquet when retrieving a
> > small scattered selection of rows from a large dataset.
> >
> > I went ahead and reproduced the experiment in the second paper using
> > Lance.  Specifically, a vector search for 10 elements against the first
> 100
> > million rows of the laion 5b dataset.  There were a few details missing
> in
> > the paper (specifically around what index they used) and I ended up
> > training a rather underwhelming index but the performance of the index is
> > unrelated to the file format and so irrelevant for this discussion
> anyways.
> >
> > Vector searches perform a CPU intensive index search against a relatively
> > small index (in the paper this index was kept in memory and so I did the
> > same for my experiment).  This identifies the rows of interest.  We then
> > need to perform a take operation to select those rows from storage.  This
> > is the part where the file format matters. So all we are really measuring
> > here is how long it takes to select N rows at random from a dataset.
> This
> > is one of the use cases Lance was designed for and so it is no surprise
> > that it performs better.
> >
> > Note that Lance stores data uncompressed.  However, it probably doesn't
> > matter in this case.  100 million rows of Laion 5B requires ~320GB.  Only
> > 20GB of this is metadata.  The remaining 300GB is text & image
> embeddings.
> > These embeddings are, by design, not very compressible.  The entire lance
> > dataset required 330GB.
> >
> > # Results:
> >
> > The chart in the paper is quite small and uses a log scale.  I had to
> infer
> > the performance numbers for parquet & orc as best I could.  The numbers
> for
> > lance are accurate as that is what I measured.  These results are
> averaged
> > from 64 randomized queries (each iteration ran a single query to return
> 10
> > results) with the kernel's disk cache cleared (same as the paper I
> > believe).
> >
> > ## S3:
> >
> > Parquet: ~12,000ms
> > Orc: ~80,000ms
> > Lance: 1,696ms
> >
> > ## Local Storage (SSD):
> >
> > Parquet: ~800ms
> > Orc: ~65ms (10ms spent searching index)
> > Lance: 61ms (59ms spent searching index)
> >
> > At first glance it may seem like Lance performs about the same as Orc
> with
> > an SSD.  However, this is likely because my index was suboptimal (I did
> not
> > spend any real time tuning it since I could just look at the I/O times
> > directly).  The lance format spent only 2ms on I/O compared with ~55ms
> > spent on I/O by Orc.
> >
> > # Boring Details:
> >
> > Index: IVF/PQ with 1000 IVF partitions (probably should have been 10k
> > partitions but I'm not patient enough) and 96 PQ subvectors (1 byte per
> > subvector)
> > Hardware: Tests were performed on an r6id.2xlarge (using the attached
> NVME
> > for the SSD tests) in the same region as the S3 storage
> >
> > Minor detail: The embeddings provided with the laion 5b dataset (clip
> > vit-l/14) were provided as float16.  Lance doesn't yet support float16
> and
> > so I inflated these to float32 (that doubles the amount of data retrieved
> > so, if anything, it's just making things harder on lance)
> >
> > On Thu, Oct 19, 2023 at 9:55 AM Aldrin <octalene....@pm.me.invalid>
> wrote:
> >
> > > And the first paper's reference of arrow (in the references section)
> > lists
> > > 2022 as the date of last access.
> > >
> > > Sent from Proton Mail <https://proton.me/mail/home> for iOS
> > >
> > >
> > > On Thu, Oct 19, 2023 at 18:51, Aldrin <octalene....@pm.me.INVALID
> > > <On+Thu,+Oct+19,+2023+at+18:51,+Aldrin+%3C%3Ca+href=>> wrote:
> > >
> > > For context, that second referenced paper has Wes McKinney as a
> > co-author,
> > > so they were much better positioned to say "the right things."
> > >
> > > Sent from Proton Mail <https://proton.me/mail/home> for iOS
> > >
> > >
> > > On Thu, Oct 19, 2023 at 18:38, Jin Shang <shangjin1...@gmail.com
> > > <On+Thu,+Oct+19,+2023+at+18:38,+Jin+Shang+%3C%3Ca+href=>> wrote:
> > >
> > > Honestly I don't understand why this VLDB paper [1] chooses to include
> > > Feather in their evaluations. This paper studies OLAP DBMS file
> formats.
> > > Feather is clearly not optimized for the workload and performs badly in
> > > most of their benchmarks. This paper also has several inaccurate or
> > > outdated claims about Arrow, e.g. Arrow has no run length encoding,
> > Arrow's
> > > dictionary encoding only supports string types (Table 3 and 5), Feather
> > is
> > > Arrow plus dictionary encoding and compression (Section 3.2) etc.
> > Moreover,
> > > the two optimizations it proposes for Arrow (in Section 8.1.1 and
> 8.1.3)
> > > are actually just two new APIs for working with Arrow data that require
> > no
> > > change to the Arrow format itself. I fear that this paper may actually
> > > discourage DB people from using Arrow as their *in memory *format, even
> > > though it's the Arrow *file* format that performs badly for their
> > workload.
> > >
> > > There is another paper "An Empirical Evaluation of Columnar Storage
> > > Formats" [2] covering essentially the same topic. It however chooses
> not
> > to
> > > evaluate Arrow (in Section 2) because "Arrow is not meant for long-term
> > > disk storage", citing Wes McKinney's blog post [3] from six years ago.
> > (Wes
> > > is also a co-author of this paper). Interestingly, the VLDB paper [1]
> > also
> > > talks about the two blog posts [3][4] (in Section 9) and stated that
> > > "several limitations (of Arrow) described in [4] persist to this day",
> > > which IMHO is not true.
> > >
> > > P.S. The second paper [2] also talks about ML workloads (in Section
> 5.8)
> > > and GPU performance (in Section 5.9). It also cites Lance as one of the
> > > future formats (in Section 5.6.2).
> > >
> > > [1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf
> > > [2] https://arxiv.org/abs/2304.05028
> > > [3] https://wesmckinney.com/blog/arrow-columnar-abadi/
> > > [4]
> > >
> > >
> >
> https://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html
> > >
> > > On Thu, Oct 19, 2023 at 7:38 PM Roman Shaposhnik <ro...@shaposhnik.org
> >
> > > wrote:
> > >
> > > > On Wed, Oct 18, 2023 at 11:20 PM Andrew Lamb <al...@influxdata.com>
> > > wrote:
> > > > >
> > > > > If you are looking for a more formal discussion and empirical
> > analysis
> > > of
> > > > > the differences, I suggest reading "A Deep Dive into Common Open
> > > Formats
> > > > > for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that
> > > > > compares and contrasts Arrow, Parquet, ORC and Feather file
> formats.
> > > > >
> > > > > [1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf
> > > >
> > > > This is a very useful article, but it seems to be taking a DBMS
> angle.
> > > > I'm wondering
> > > > if anyone has seen similar research but with more of an ML/DL angle
> > > taken.
> > > >
> > > > Of course, what I'm really asking for is to see how Lance would
> compare
> > > ;-)
> > > >
> > > > Thanks,
> > > > Roman.
> > > >
> > > > >
> > > > > On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies
> > > > > <r.taylordav...@googlemail.com.invalid> wrote:
> > > > >
> > > > > > To further what others have already mentioned, the IPC file
> format
> > is
> > > > > > primarily optimised for IPC use-cases, that is exchanging the
> > entire
> > > > > > contents between processes. It is relatively inexpensive to
> encode
> > > and
> > > > > > decode, and supports all arrow datatypes, making it ideal for
> > things
> > > > > > like spill-to-disk processing, distributed shuffles, etc...
> > > > > >
> > > > > > Parquet by comparison is a storage format, optimised for space
> > > > > > efficiency and selective querying, with [1] containing an
> overview
> > of
> > > > > > the various techniques the format affords. It is comparatively
> > > > expensive
> > > > > > to encode and decode, and instead relies on index structures and
> > > > > > statistics to accelerate access.
> > > > > >
> > > > > > Both are therefore perfectly viable options depending on your
> > > > particular
> > > > > > use-case.
> > > > > >
> > > > > > [1]:
> > > > > >
> > > > > >
> > > >
> > >
> >
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> > > > > >
> > > > > > On 18/10/2023 13:59, Dewey Dunnington wrote:
> > > > > > > Plenty of opinions here already, but I happen to think that IPC
> > > > > > > streams and/or Arrow File/Feather are wildly underutilized. For
> > the
> > > > > > > use-case where you're mostly just going to read an entire file
> > > into R
> > > > > > > or Python it's a bit faster (and far superior to a CSV or
> > pickling
> > > or
> > > > > > > .rds files in R).
> > > > > > >
> > > > > > >> you're going to read all the columns for a record batch in the
> > > > file, no
> > > > > > matter what
> > > > > > > The metadata for each every column in every record batch has to
> > be
> > > > > > > read, but there's nothing inherent about the format that
> prevents
> > > > > > > selectively loading into memory only the required buffers. (I
> > don't
> > > > > > > know off the top of my head if any reader implementation
> actually
> > > > does
> > > > > > > this).
> > > > > > >
> > > > > > > On Wed, Oct 18, 2023 at 12:02 AM wish maple <
> > > maplewish...@gmail.com>
> > > > > > wrote:
> > > > > > >> Arrow IPC file is great, it focuses on in-memory
> representation
> > > and
> > > > > > direct
> > > > > > >> computation.
> > > > > > >> Basically, it can support compression and dictionary encoding,
> > and
> > > > can
> > > > > > >> zero-copy
> > > > > > >> deserialize the file to memory Arrow format.
> > > > > > >>
> > > > > > >> Parquet provides some strong functionality, like Statistics,
> > which
> > > > could
> > > > > > >> help pruning
> > > > > > >> unnecessary data during scanning and avoid cpu and io cust.
> And
> > it
> > > > has
> > > > > > high
> > > > > > >> efficient
> > > > > > >> encoding, which could make the Parquet file smaller than the
> > Arrow
> > > > IPC
> > > > > > file
> > > > > > >> under the same
> > > > > > >> data. However, currently some arrow data type cannot be
> convert
> > to
> > > > > > >> correspond Parquet type
> > > > > > >> in the current arrow-cpp implementation. You can goto the
> arrow
> > > > > > document to
> > > > > > >> take a look.
> > > > > > >>
> > > > > > >> Adam Lippai <a...@rigo.sk> 于2023年10月18日周三 10:50写道：
> > > > > > >>
> > > > > > >>> Also there is
> > > > > > >>> https://github.com/lancedb/lance between the two formats.
> > > > Depending
> > > > > > on the
> > > > > > >>> use case it can be a great choice.
> > > > > > >>>
> > > > > > >>> Best regards
> > > > > > >>> Adam Lippai
> > > > > > >>>
> > > > > > >>> On Tue, Oct 17, 2023 at 22:44 Matt Topol <
> > zotthewiz...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >>>
> > > > > > >>>> One benefit of the feather format (i.e. Arrow IPC file
> format)
> > > is
> > > > the
> > > > > > >>>> ability to mmap the file to easily handle reading sections
> of
> > a
> > > > larger
> > > > > > >>> than
> > > > > > >>>> memory file of data. Since, as Felipe mentioned, the format
> is
> > > > > > focused on
> > > > > > >>>> in-memory representation, you can easily and simply mmap the
> > > file
> > > > and
> > > > > > use
> > > > > > >>>> the raw bytes directly. For a large file that you only want
> to
> > > > read
> > > > > > >>>> sections of, this can be beneficial for IO and memory usage.
> > > > > > >>>>
> > > > > > >>>> Unfortunately, you are correct that it doesn't allow for
> easy
> > > > column
> > > > > > >>>> projecting (you're going to read all the columns for a
> record
> > > > batch in
> > > > > > >>> the
> > > > > > >>>> file, no matter what). So it's going to be a trade off based
> > on
> > > > your
> > > > > > >>> needs
> > > > > > >>>> as to whether it makes sense, or if you should use a file
> > format
> > > > like
> > > > > > >>>> Parquet instead.
> > > > > > >>>>
> > > > > > >>>> -Matt
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
> > > > > > >>>> felipe...@gmail.com>
> > > > > > >>>> wrote:
> > > > > > >>>>
> > > > > > >>>>> It’s not the best since the format is really focused on in-
> > > > memory
> > > > > > >>>>> representation and direct computation, but you can do it:
> > > > > > >>>>>
> > > > > > >>>>> https://arrow.apache.org/docs/python/feather.html
> > > > > > >>>>>
> > > > > > >>>>> —
> > > > > > >>>>> Felipe
> > > > > > >>>>>
> > > > > > >>>>> On Tue, 17 Oct 2023 at 23:26 Nara <
> > > > narayanan.arunacha...@gmail.com>
> > > > > > >>>> wrote:
> > > > > > >>>>>> Hi,
> > > > > > >>>>>>
> > > > > > >>>>>> Is it a good idea to use Apache Arrow as a file format?
> > Looks
> > > > like
> > > > > > >>>>>> projecting columns isn't available by default.
> > > > > > >>>>>>
> > > > > > >>>>>> One of the benefits of Parquet file format is column
> > > projection,
> > > > > > >>> where
> > > > > > >>>>> the
> > > > > > >>>>>> IO is limited to just the columns projected.
> > > > > > >>>>>>
> > > > > > >>>>>> Regards ,
> > > > > > >>>>>> Nara
> > > > > > >>>>>>
> > > > > >
> > > >
> > >
> > >
> >
>

Re: Apache Arrow file format

Reply via email to