On Wed, Oct 18, 2023 at 11:20 PM Andrew Lamb <al...@influxdata.com> wrote: > > If you are looking for a more formal discussion and empirical analysis of > the differences, I suggest reading "A Deep Dive into Common Open Formats > for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that > compares and contrasts Arrow, Parquet, ORC and Feather file formats. > > [1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf
This is a very useful article, but it seems to be taking a DBMS angle. I'm wondering if anyone has seen similar research but with more of an ML/DL angle taken. Of course, what I'm really asking for is to see how Lance would compare ;-) Thanks, Roman. > > On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies > <r.taylordav...@googlemail.com.invalid> wrote: > > > To further what others have already mentioned, the IPC file format is > > primarily optimised for IPC use-cases, that is exchanging the entire > > contents between processes. It is relatively inexpensive to encode and > > decode, and supports all arrow datatypes, making it ideal for things > > like spill-to-disk processing, distributed shuffles, etc... > > > > Parquet by comparison is a storage format, optimised for space > > efficiency and selective querying, with [1] containing an overview of > > the various techniques the format affords. It is comparatively expensive > > to encode and decode, and instead relies on index structures and > > statistics to accelerate access. > > > > Both are therefore perfectly viable options depending on your particular > > use-case. > > > > [1]: > > > > https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/ > > > > On 18/10/2023 13:59, Dewey Dunnington wrote: > > > Plenty of opinions here already, but I happen to think that IPC > > > streams and/or Arrow File/Feather are wildly underutilized. For the > > > use-case where you're mostly just going to read an entire file into R > > > or Python it's a bit faster (and far superior to a CSV or pickling or > > > .rds files in R). > > > > > >> you're going to read all the columns for a record batch in the file, no > > matter what > > > The metadata for each every column in every record batch has to be > > > read, but there's nothing inherent about the format that prevents > > > selectively loading into memory only the required buffers. (I don't > > > know off the top of my head if any reader implementation actually does > > > this). > > > > > > On Wed, Oct 18, 2023 at 12:02 AM wish maple <maplewish...@gmail.com> > > wrote: > > >> Arrow IPC file is great, it focuses on in-memory representation and > > direct > > >> computation. > > >> Basically, it can support compression and dictionary encoding, and can > > >> zero-copy > > >> deserialize the file to memory Arrow format. > > >> > > >> Parquet provides some strong functionality, like Statistics, which could > > >> help pruning > > >> unnecessary data during scanning and avoid cpu and io cust. And it has > > high > > >> efficient > > >> encoding, which could make the Parquet file smaller than the Arrow IPC > > file > > >> under the same > > >> data. However, currently some arrow data type cannot be convert to > > >> correspond Parquet type > > >> in the current arrow-cpp implementation. You can goto the arrow > > document to > > >> take a look. > > >> > > >> Adam Lippai <a...@rigo.sk> 于2023年10月18日周三 10:50写道: > > >> > > >>> Also there is > > >>> https://github.com/lancedb/lance between the two formats. Depending > > on the > > >>> use case it can be a great choice. > > >>> > > >>> Best regards > > >>> Adam Lippai > > >>> > > >>> On Tue, Oct 17, 2023 at 22:44 Matt Topol <zotthewiz...@gmail.com> > > wrote: > > >>> > > >>>> One benefit of the feather format (i.e. Arrow IPC file format) is the > > >>>> ability to mmap the file to easily handle reading sections of a larger > > >>> than > > >>>> memory file of data. Since, as Felipe mentioned, the format is > > focused on > > >>>> in-memory representation, you can easily and simply mmap the file and > > use > > >>>> the raw bytes directly. For a large file that you only want to read > > >>>> sections of, this can be beneficial for IO and memory usage. > > >>>> > > >>>> Unfortunately, you are correct that it doesn't allow for easy column > > >>>> projecting (you're going to read all the columns for a record batch in > > >>> the > > >>>> file, no matter what). So it's going to be a trade off based on your > > >>> needs > > >>>> as to whether it makes sense, or if you should use a file format like > > >>>> Parquet instead. > > >>>> > > >>>> -Matt > > >>>> > > >>>> > > >>>> On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho < > > >>>> felipe...@gmail.com> > > >>>> wrote: > > >>>> > > >>>>> It’s not the best since the format is really focused on in- memory > > >>>>> representation and direct computation, but you can do it: > > >>>>> > > >>>>> https://arrow.apache.org/docs/python/feather.html > > >>>>> > > >>>>> — > > >>>>> Felipe > > >>>>> > > >>>>> On Tue, 17 Oct 2023 at 23:26 Nara <narayanan.arunacha...@gmail.com> > > >>>> wrote: > > >>>>>> Hi, > > >>>>>> > > >>>>>> Is it a good idea to use Apache Arrow as a file format? Looks like > > >>>>>> projecting columns isn't available by default. > > >>>>>> > > >>>>>> One of the benefits of Parquet file format is column projection, > > >>> where > > >>>>> the > > >>>>>> IO is limited to just the columns projected. > > >>>>>> > > >>>>>> Regards , > > >>>>>> Nara > > >>>>>> > >