Re: Apache Arrow file format

Roman Shaposhnik Thu, 19 Oct 2023 04:38:06 -0700

On Wed, Oct 18, 2023 at 11:20 PM Andrew Lamb <al...@influxdata.com> wrote:
>
> If you are looking for a more formal discussion and empirical analysis of
> the differences, I suggest reading "A Deep Dive into Common Open Formats
> for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that
> compares and contrasts Arrow, Parquet, ORC and Feather file formats.
>
> [1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf


This is a very useful article, but it seems to be taking a DBMS angle.
I'm wondering
if anyone has seen similar research but with more of an ML/DL angle taken.

Of course, what I'm really asking for is to see how Lance would compare ;-)

Thanks,
Roman.

>
> On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies
> <r.taylordav...@googlemail.com.invalid> wrote:
>
> > To further what others have already mentioned, the IPC file format is
> > primarily optimised for IPC use-cases, that is exchanging the entire
> > contents between processes. It is relatively inexpensive to encode and
> > decode, and supports all arrow datatypes, making it ideal for things
> > like spill-to-disk processing, distributed shuffles, etc...
> >
> > Parquet by comparison is a storage format, optimised for space
> > efficiency and selective querying, with [1] containing an overview of
> > the various techniques the format affords. It is comparatively expensive
> > to encode and decode, and instead relies on index structures and
> > statistics to accelerate access.
> >
> > Both are therefore perfectly viable options depending on your particular
> > use-case.
> >
> > [1]:
> >
> > https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> >
> > On 18/10/2023 13:59, Dewey Dunnington wrote:
> > > Plenty of opinions here already, but I happen to think that IPC
> > > streams and/or Arrow File/Feather are wildly underutilized. For the
> > > use-case where you're mostly just going to read an entire file into R
> > > or Python it's a bit faster (and far superior to a CSV or pickling or
> > > .rds files in R).
> > >
> > >> you're going to read all the columns for a record batch in the file, no
> > matter what
> > > The metadata for each every column in every record batch has to be
> > > read, but there's nothing inherent about the format that prevents
> > > selectively loading into memory only the required buffers. (I don't
> > > know off the top of my head if any reader implementation actually does
> > > this).
> > >
> > > On Wed, Oct 18, 2023 at 12:02 AM wish maple <maplewish...@gmail.com>
> > wrote:
> > >> Arrow IPC file is great, it focuses on in-memory representation and
> > direct
> > >> computation.
> > >> Basically, it can support compression and dictionary encoding, and can
> > >> zero-copy
> > >> deserialize the file to memory Arrow format.
> > >>
> > >> Parquet provides some strong functionality, like Statistics, which could
> > >> help pruning
> > >> unnecessary data during scanning and avoid cpu and io cust. And it has
> > high
> > >> efficient
> > >> encoding, which could make the Parquet file smaller than the Arrow IPC
> > file
> > >> under the same
> > >> data. However, currently some arrow data type cannot be convert to
> > >> correspond Parquet type
> > >> in the current arrow-cpp implementation. You can goto the arrow
> > document to
> > >> take a look.
> > >>
> > >> Adam Lippai <a...@rigo.sk> 于2023年10月18日周三 10:50写道：
> > >>
> > >>> Also there is
> > >>> https://github.com/lancedb/lance between the two formats. Depending
> > on the
> > >>> use case it can be a great choice.
> > >>>
> > >>> Best regards
> > >>> Adam Lippai
> > >>>
> > >>> On Tue, Oct 17, 2023 at 22:44 Matt Topol <zotthewiz...@gmail.com>
> > wrote:
> > >>>
> > >>>> One benefit of the feather format (i.e. Arrow IPC file format) is the
> > >>>> ability to mmap the file to easily handle reading sections of a larger
> > >>> than
> > >>>> memory file of data. Since, as Felipe mentioned, the format is
> > focused on
> > >>>> in-memory representation, you can easily and simply mmap the file and
> > use
> > >>>> the raw bytes directly. For a large file that you only want to read
> > >>>> sections of, this can be beneficial for IO and memory usage.
> > >>>>
> > >>>> Unfortunately, you are correct that it doesn't allow for easy column
> > >>>> projecting (you're going to read all the columns for a record batch in
> > >>> the
> > >>>> file, no matter what). So it's going to be a trade off based on your
> > >>> needs
> > >>>> as to whether it makes sense, or if you should use a file format like
> > >>>> Parquet instead.
> > >>>>
> > >>>> -Matt
> > >>>>
> > >>>>
> > >>>> On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
> > >>>> felipe...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> It’s not the best since the format is really focused on in- memory
> > >>>>> representation and direct computation, but you can do it:
> > >>>>>
> > >>>>> https://arrow.apache.org/docs/python/feather.html
> > >>>>>
> > >>>>> —
> > >>>>> Felipe
> > >>>>>
> > >>>>> On Tue, 17 Oct 2023 at 23:26 Nara <narayanan.arunacha...@gmail.com>
> > >>>> wrote:
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> Is it a good idea to use Apache Arrow as a file format? Looks like
> > >>>>>> projecting columns isn't available by default.
> > >>>>>>
> > >>>>>> One of the benefits of Parquet file format is column projection,
> > >>> where
> > >>>>> the
> > >>>>>> IO is limited to just the columns projected.
> > >>>>>>
> > >>>>>> Regards ,
> > >>>>>> Nara
> > >>>>>>
> >

Re: Apache Arrow file format

Reply via email to