[jira] [Commented] (ARROW-4313) Define general benchmark database schema

Tanya Schlusser (JIRA) Tue, 29 Jan 2019 18:00:59 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755570#comment-16755570
 ]


Tanya Schlusser commented on ARROW-4313:
----------------------------------------

I think part of this was to allow anybody to contribute benchmarks from their 
own machine. And while dedicated benchmarking machines like the ones you will 
set up will have all parameters set for optimal benchmarking, benchmarks run on 
other machines may give different results. Collecting details about the machine 
that might explain those differences (in case someone cares to explore the 
dataset) is part of the goal of the data model.

One concern, of course, is that people get wildly different results than a 
benchmark says, and may say "Oh boo–the representative person from the company 
made fake results that I can't replicate on my machine" ... and with details 
about a system, performance differences can maybe be traced back to differences 
in setup, because they were recorded.

Not all fields need to be filled out all the time. My priorities are:
 # Identifying which fields flat-out wrong
 # Differentiating between necessary columns and extraneous ones that can be 
left null


To me, it is not a big deal to have an extra column dangling around that almost 
nobody uses. No harm. (Unless it's mislabeled or otherwise wrong; that's what 
I'm interested in getting out of the discussion here.)

> Define general benchmark database schema
> ----------------------------------------
>
>                 Key: ARROW-4313
>                 URL: https://issues.apache.org/jira/browse/ARROW-4313
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Benchmarking
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 0.13.0
>
>         Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-4313) Define general benchmark database schema

Reply via email to