> Of course, what I'm really asking for is to see how Lance would compare ;-)
> P.S. The second paper [2] also talks about ML workloads (in Section 5.8) > and GPU performance (in Section 5.9). It also cites Lance as one of the > future formats (in Section 5.6.2). Disclaimer: I work for LanceDb and am in no way an unbiased party. However, since you asked: TL;DR: Lance performs 10-20x better than orc or parquet when retrieving a small scattered selection of rows from a large dataset. I went ahead and reproduced the experiment in the second paper using Lance. Specifically, a vector search for 10 elements against the first 100 million rows of the laion 5b dataset. There were a few details missing in the paper (specifically around what index they used) and I ended up training a rather underwhelming index but the performance of the index is unrelated to the file format and so irrelevant for this discussion anyways. Vector searches perform a CPU intensive index search against a relatively small index (in the paper this index was kept in memory and so I did the same for my experiment). This identifies the rows of interest. We then need to perform a take operation to select those rows from storage. This is the part where the file format matters. So all we are really measuring here is how long it takes to select N rows at random from a dataset. This is one of the use cases Lance was designed for and so it is no surprise that it performs better. Note that Lance stores data uncompressed. However, it probably doesn't matter in this case. 100 million rows of Laion 5B requires ~320GB. Only 20GB of this is metadata. The remaining 300GB is text & image embeddings. These embeddings are, by design, not very compressible. The entire lance dataset required 330GB. # Results: The chart in the paper is quite small and uses a log scale. I had to infer the performance numbers for parquet & orc as best I could. The numbers for lance are accurate as that is what I measured. These results are averaged from 64 randomized queries (each iteration ran a single query to return 10 results) with the kernel's disk cache cleared (same as the paper I believe). ## S3: Parquet: ~12,000ms Orc: ~80,000ms Lance: 1,696ms ## Local Storage (SSD): Parquet: ~800ms Orc: ~65ms (10ms spent searching index) Lance: 61ms (59ms spent searching index) At first glance it may seem like Lance performs about the same as Orc with an SSD. However, this is likely because my index was suboptimal (I did not spend any real time tuning it since I could just look at the I/O times directly). The lance format spent only 2ms on I/O compared with ~55ms spent on I/O by Orc. # Boring Details: Index: IVF/PQ with 1000 IVF partitions (probably should have been 10k partitions but I'm not patient enough) and 96 PQ subvectors (1 byte per subvector) Hardware: Tests were performed on an r6id.2xlarge (using the attached NVME for the SSD tests) in the same region as the S3 storage Minor detail: The embeddings provided with the laion 5b dataset (clip vit-l/14) were provided as float16. Lance doesn't yet support float16 and so I inflated these to float32 (that doubles the amount of data retrieved so, if anything, it's just making things harder on lance) On Thu, Oct 19, 2023 at 9:55 AM Aldrin <octalene....@pm.me.invalid> wrote: > And the first paper's reference of arrow (in the references section) lists > 2022 as the date of last access. > > Sent from Proton Mail <https://proton.me/mail/home> for iOS > > > On Thu, Oct 19, 2023 at 18:51, Aldrin <octalene....@pm.me.INVALID > <On+Thu,+Oct+19,+2023+at+18:51,+Aldrin+%3C%3Ca+href=>> wrote: > > For context, that second referenced paper has Wes McKinney as a co-author, > so they were much better positioned to say "the right things." > > Sent from Proton Mail <https://proton.me/mail/home> for iOS > > > On Thu, Oct 19, 2023 at 18:38, Jin Shang <shangjin1...@gmail.com > <On+Thu,+Oct+19,+2023+at+18:38,+Jin+Shang+%3C%3Ca+href=>> wrote: > > Honestly I don't understand why this VLDB paper [1] chooses to include > Feather in their evaluations. This paper studies OLAP DBMS file formats. > Feather is clearly not optimized for the workload and performs badly in > most of their benchmarks. This paper also has several inaccurate or > outdated claims about Arrow, e.g. Arrow has no run length encoding, Arrow's > dictionary encoding only supports string types (Table 3 and 5), Feather is > Arrow plus dictionary encoding and compression (Section 3.2) etc. Moreover, > the two optimizations it proposes for Arrow (in Section 8.1.1 and 8.1.3) > are actually just two new APIs for working with Arrow data that require no > change to the Arrow format itself. I fear that this paper may actually > discourage DB people from using Arrow as their *in memory *format, even > though it's the Arrow *file* format that performs badly for their workload. > > There is another paper "An Empirical Evaluation of Columnar Storage > Formats" [2] covering essentially the same topic. It however chooses not to > evaluate Arrow (in Section 2) because "Arrow is not meant for long-term > disk storage", citing Wes McKinney's blog post [3] from six years ago. (Wes > is also a co-author of this paper). Interestingly, the VLDB paper [1] also > talks about the two blog posts [3][4] (in Section 9) and stated that > "several limitations (of Arrow) described in [4] persist to this day", > which IMHO is not true. > > P.S. The second paper [2] also talks about ML workloads (in Section 5.8) > and GPU performance (in Section 5.9). It also cites Lance as one of the > future formats (in Section 5.6.2). > > [1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf > [2] https://arxiv.org/abs/2304.05028 > [3] https://wesmckinney.com/blog/arrow-columnar-abadi/ > [4] > > https://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html > > On Thu, Oct 19, 2023 at 7:38 PM Roman Shaposhnik <ro...@shaposhnik.org> > wrote: > > > On Wed, Oct 18, 2023 at 11:20 PM Andrew Lamb <al...@influxdata.com> > wrote: > > > > > > If you are looking for a more formal discussion and empirical analysis > of > > > the differences, I suggest reading "A Deep Dive into Common Open > Formats > > > for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that > > > compares and contrasts Arrow, Parquet, ORC and Feather file formats. > > > > > > [1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf > > > > This is a very useful article, but it seems to be taking a DBMS angle. > > I'm wondering > > if anyone has seen similar research but with more of an ML/DL angle > taken. > > > > Of course, what I'm really asking for is to see how Lance would compare > ;-) > > > > Thanks, > > Roman. > > > > > > > > On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies > > > <r.taylordav...@googlemail.com.invalid> wrote: > > > > > > > To further what others have already mentioned, the IPC file format is > > > > primarily optimised for IPC use-cases, that is exchanging the entire > > > > contents between processes. It is relatively inexpensive to encode > and > > > > decode, and supports all arrow datatypes, making it ideal for things > > > > like spill-to-disk processing, distributed shuffles, etc... > > > > > > > > Parquet by comparison is a storage format, optimised for space > > > > efficiency and selective querying, with [1] containing an overview of > > > > the various techniques the format affords. It is comparatively > > expensive > > > > to encode and decode, and instead relies on index structures and > > > > statistics to accelerate access. > > > > > > > > Both are therefore perfectly viable options depending on your > > particular > > > > use-case. > > > > > > > > [1]: > > > > > > > > > > > https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/ > > > > > > > > On 18/10/2023 13:59, Dewey Dunnington wrote: > > > > > Plenty of opinions here already, but I happen to think that IPC > > > > > streams and/or Arrow File/Feather are wildly underutilized. For the > > > > > use-case where you're mostly just going to read an entire file > into R > > > > > or Python it's a bit faster (and far superior to a CSV or pickling > or > > > > > .rds files in R). > > > > > > > > > >> you're going to read all the columns for a record batch in the > > file, no > > > > matter what > > > > > The metadata for each every column in every record batch has to be > > > > > read, but there's nothing inherent about the format that prevents > > > > > selectively loading into memory only the required buffers. (I don't > > > > > know off the top of my head if any reader implementation actually > > does > > > > > this). > > > > > > > > > > On Wed, Oct 18, 2023 at 12:02 AM wish maple < > maplewish...@gmail.com> > > > > wrote: > > > > >> Arrow IPC file is great, it focuses on in-memory representation > and > > > > direct > > > > >> computation. > > > > >> Basically, it can support compression and dictionary encoding, and > > can > > > > >> zero-copy > > > > >> deserialize the file to memory Arrow format. > > > > >> > > > > >> Parquet provides some strong functionality, like Statistics, which > > could > > > > >> help pruning > > > > >> unnecessary data during scanning and avoid cpu and io cust. And it > > has > > > > high > > > > >> efficient > > > > >> encoding, which could make the Parquet file smaller than the Arrow > > IPC > > > > file > > > > >> under the same > > > > >> data. However, currently some arrow data type cannot be convert to > > > > >> correspond Parquet type > > > > >> in the current arrow-cpp implementation. You can goto the arrow > > > > document to > > > > >> take a look. > > > > >> > > > > >> Adam Lippai <a...@rigo.sk> 于2023年10月18日周三 10:50写道: > > > > >> > > > > >>> Also there is > > > > >>> https://github.com/lancedb/lance between the two formats. > > Depending > > > > on the > > > > >>> use case it can be a great choice. > > > > >>> > > > > >>> Best regards > > > > >>> Adam Lippai > > > > >>> > > > > >>> On Tue, Oct 17, 2023 at 22:44 Matt Topol <zotthewiz...@gmail.com > > > > > > wrote: > > > > >>> > > > > >>>> One benefit of the feather format (i.e. Arrow IPC file format) > is > > the > > > > >>>> ability to mmap the file to easily handle reading sections of a > > larger > > > > >>> than > > > > >>>> memory file of data. Since, as Felipe mentioned, the format is > > > > focused on > > > > >>>> in-memory representation, you can easily and simply mmap the > file > > and > > > > use > > > > >>>> the raw bytes directly. For a large file that you only want to > > read > > > > >>>> sections of, this can be beneficial for IO and memory usage. > > > > >>>> > > > > >>>> Unfortunately, you are correct that it doesn't allow for easy > > column > > > > >>>> projecting (you're going to read all the columns for a record > > batch in > > > > >>> the > > > > >>>> file, no matter what). So it's going to be a trade off based on > > your > > > > >>> needs > > > > >>>> as to whether it makes sense, or if you should use a file format > > like > > > > >>>> Parquet instead. > > > > >>>> > > > > >>>> -Matt > > > > >>>> > > > > >>>> > > > > >>>> On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho < > > > > >>>> felipe...@gmail.com> > > > > >>>> wrote: > > > > >>>> > > > > >>>>> It’s not the best since the format is really focused on in- > > memory > > > > >>>>> representation and direct computation, but you can do it: > > > > >>>>> > > > > >>>>> https://arrow.apache.org/docs/python/feather.html > > > > >>>>> > > > > >>>>> — > > > > >>>>> Felipe > > > > >>>>> > > > > >>>>> On Tue, 17 Oct 2023 at 23:26 Nara < > > narayanan.arunacha...@gmail.com> > > > > >>>> wrote: > > > > >>>>>> Hi, > > > > >>>>>> > > > > >>>>>> Is it a good idea to use Apache Arrow as a file format? Looks > > like > > > > >>>>>> projecting columns isn't available by default. > > > > >>>>>> > > > > >>>>>> One of the benefits of Parquet file format is column > projection, > > > > >>> where > > > > >>>>> the > > > > >>>>>> IO is limited to just the columns projected. > > > > >>>>>> > > > > >>>>>> Regards , > > > > >>>>>> Nara > > > > >>>>>> > > > > > > > >