If you are looking for a more formal discussion and empirical analysis of the differences, I suggest reading "A Deep Dive into Common Open Formats for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that compares and contrasts Arrow, Parquet, ORC and Feather file formats.
[1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies <r.taylordav...@googlemail.com.invalid> wrote: > To further what others have already mentioned, the IPC file format is > primarily optimised for IPC use-cases, that is exchanging the entire > contents between processes. It is relatively inexpensive to encode and > decode, and supports all arrow datatypes, making it ideal for things > like spill-to-disk processing, distributed shuffles, etc... > > Parquet by comparison is a storage format, optimised for space > efficiency and selective querying, with [1] containing an overview of > the various techniques the format affords. It is comparatively expensive > to encode and decode, and instead relies on index structures and > statistics to accelerate access. > > Both are therefore perfectly viable options depending on your particular > use-case. > > [1]: > > https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/ > > On 18/10/2023 13:59, Dewey Dunnington wrote: > > Plenty of opinions here already, but I happen to think that IPC > > streams and/or Arrow File/Feather are wildly underutilized. For the > > use-case where you're mostly just going to read an entire file into R > > or Python it's a bit faster (and far superior to a CSV or pickling or > > .rds files in R). > > > >> you're going to read all the columns for a record batch in the file, no > matter what > > The metadata for each every column in every record batch has to be > > read, but there's nothing inherent about the format that prevents > > selectively loading into memory only the required buffers. (I don't > > know off the top of my head if any reader implementation actually does > > this). > > > > On Wed, Oct 18, 2023 at 12:02 AM wish maple <maplewish...@gmail.com> > wrote: > >> Arrow IPC file is great, it focuses on in-memory representation and > direct > >> computation. > >> Basically, it can support compression and dictionary encoding, and can > >> zero-copy > >> deserialize the file to memory Arrow format. > >> > >> Parquet provides some strong functionality, like Statistics, which could > >> help pruning > >> unnecessary data during scanning and avoid cpu and io cust. And it has > high > >> efficient > >> encoding, which could make the Parquet file smaller than the Arrow IPC > file > >> under the same > >> data. However, currently some arrow data type cannot be convert to > >> correspond Parquet type > >> in the current arrow-cpp implementation. You can goto the arrow > document to > >> take a look. > >> > >> Adam Lippai <a...@rigo.sk> 于2023年10月18日周三 10:50写道: > >> > >>> Also there is > >>> https://github.com/lancedb/lance between the two formats. Depending > on the > >>> use case it can be a great choice. > >>> > >>> Best regards > >>> Adam Lippai > >>> > >>> On Tue, Oct 17, 2023 at 22:44 Matt Topol <zotthewiz...@gmail.com> > wrote: > >>> > >>>> One benefit of the feather format (i.e. Arrow IPC file format) is the > >>>> ability to mmap the file to easily handle reading sections of a larger > >>> than > >>>> memory file of data. Since, as Felipe mentioned, the format is > focused on > >>>> in-memory representation, you can easily and simply mmap the file and > use > >>>> the raw bytes directly. For a large file that you only want to read > >>>> sections of, this can be beneficial for IO and memory usage. > >>>> > >>>> Unfortunately, you are correct that it doesn't allow for easy column > >>>> projecting (you're going to read all the columns for a record batch in > >>> the > >>>> file, no matter what). So it's going to be a trade off based on your > >>> needs > >>>> as to whether it makes sense, or if you should use a file format like > >>>> Parquet instead. > >>>> > >>>> -Matt > >>>> > >>>> > >>>> On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho < > >>>> felipe...@gmail.com> > >>>> wrote: > >>>> > >>>>> It’s not the best since the format is really focused on in- memory > >>>>> representation and direct computation, but you can do it: > >>>>> > >>>>> https://arrow.apache.org/docs/python/feather.html > >>>>> > >>>>> — > >>>>> Felipe > >>>>> > >>>>> On Tue, 17 Oct 2023 at 23:26 Nara <narayanan.arunacha...@gmail.com> > >>>> wrote: > >>>>>> Hi, > >>>>>> > >>>>>> Is it a good idea to use Apache Arrow as a file format? Looks like > >>>>>> projecting columns isn't available by default. > >>>>>> > >>>>>> One of the benefits of Parquet file format is column projection, > >>> where > >>>>> the > >>>>>> IO is limited to just the columns projected. > >>>>>> > >>>>>> Regards , > >>>>>> Nara > >>>>>> >