[
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772315#comment-16772315
]
Wes McKinney commented on ARROW-4313:
-------------------------------------
I'm involved many projects so I haven't been able to follow the discussion to
see where there is disagreement or conflict.
>From my perspective I want the following in the short term
* A general purpose database schema, preferably for PostgreSQL, which can be
used to easily provision a new benchmark database
* A script for running the C++ benchmarks and inserting the results into the
database. This script should capture hardware information as well as any
additional information that is known about the environment (OS, thirdparty
library versions -- e.g. so we can see if upgrading a dependency, like gRPC for
example, causes a performance problem)
I think until we should work as quickly as possible to have a working version
of both of these to validate that we are on the right track. If we try to come
up with the "perfect database schema" and punt the benchmark collector script
until later we could be waiting a long time.
Ideally the database schema can accommodate results from multiple benchmark
execution frameworks other than Google benchmark for C++. So we could write an
adapter script to export data from ASV (for Python) into this database.
[~aregm] this does not seem to be out of line with the requirements you listed
unless I am misunderstanding. I would rather not be too involved with the
details right now unless the project stalls out for some reason and needs me to
help push it through to completion.
> Define general benchmark database schema
> ----------------------------------------
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Benchmarking
> Reporter: Wes McKinney
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
> Time Spent: 9h 10m
> Remaining Estimate: 0h
>
> Some possible attributes that the benchmark database should track, to permit
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)