[
https://issues.apache.org/jira/browse/ARROW-16944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561153#comment-17561153
]
David Li commented on ARROW-16944:
----------------------------------
It would be a good idea to gather some "real world" datasets to use. NYC Taxi
is an obvious one, ARROW-9612 and the associated discussion suggests Wikipedia
and US election data as well.
> [C++] Create macro-benchmarks of file format readers
> ----------------------------------------------------
>
> Key: ARROW-16944
> URL: https://issues.apache.org/jira/browse/ARROW-16944
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: David Li
> Priority: Major
>
> Currently we have (some) microbenchmarks, but measuring performance of our
> various readers (CSV, JSON, IPC, Parquet, ORC) over "real world" files would
> also be interesting and hopefully more illustrative of the use cases we
> actually care about. Such benchmarks may be expensive, though.
> Ideally, we would do this in a variety of scenarios: in-memory (to focus on
> CPU optimization), on-disk (though such measurements would likely be
> extremely noisy?), and over the network (perhaps with something like Minio +
> Toxiproxy to try to have a consistent, reproducible setup) so that we can
> also judge the I/O characteristics of the readers.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)