[ 
https://issues.apache.org/jira/browse/ARROW-16944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561153#comment-17561153
 ] 

David Li commented on ARROW-16944:
----------------------------------

It would be a good idea to gather some "real world" datasets to use. NYC Taxi 
is an obvious one, ARROW-9612 and the associated discussion suggests Wikipedia 
and US election data as well.

> [C++] Create macro-benchmarks of file format readers
> ----------------------------------------------------
>
>                 Key: ARROW-16944
>                 URL: https://issues.apache.org/jira/browse/ARROW-16944
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: David Li
>            Priority: Major
>
> Currently we have (some) microbenchmarks, but measuring performance of our 
> various readers (CSV, JSON, IPC, Parquet, ORC) over "real world" files would 
> also be interesting and hopefully more illustrative of the use cases we 
> actually care about. Such benchmarks may be expensive, though.
> Ideally, we would do this in a variety of scenarios: in-memory (to focus on 
> CPU optimization), on-disk (though such measurements would likely be 
> extremely noisy?), and over the network (perhaps with something like Minio + 
> Toxiproxy to try to have a consistent, reproducible setup) so that we can 
> also judge the I/O characteristics of the readers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to