[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772315#comment-16772315
 ] 

Wes McKinney commented on ARROW-4313:
-------------------------------------

I'm involved many projects so I haven't been able to follow the discussion to 
see where there is disagreement or conflict. 

>From my perspective I want the following in the short term

* A general purpose database schema, preferably for PostgreSQL, which can be 
used to easily provision a new benchmark database
* A script for running the C++ benchmarks and inserting the results into the 
database. This script should capture hardware information as well as any 
additional information that is known about the environment (OS, thirdparty 
library versions -- e.g. so we can see if upgrading a dependency, like gRPC for 
example, causes a performance problem)

I think until we should work as quickly as possible to have a working version 
of both of these to validate that we are on the right track. If we try to come 
up with the "perfect database schema" and punt the benchmark collector script 
until later we could be waiting a long time. 

Ideally the database schema can accommodate results from multiple benchmark 
execution frameworks other than Google benchmark for C++. So we could write an 
adapter script to export data from ASV (for Python) into this database.

[~aregm] this does not seem to be out of line with the requirements you listed 
unless I am misunderstanding. I would rather not be too involved with the 
details right now unless the project stalls out for some reason and needs me to 
help push it through to completion. 

> Define general benchmark database schema
> ----------------------------------------
>
>                 Key: ARROW-4313
>                 URL: https://issues.apache.org/jira/browse/ARROW-4313
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Benchmarking
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.13.0
>
>         Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>          Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to