[
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755570#comment-16755570
]
Tanya Schlusser commented on ARROW-4313:
----------------------------------------
I think part of this was to allow anybody to contribute benchmarks from their
own machine. And while dedicated benchmarking machines like the ones you will
set up will have all parameters set for optimal benchmarking, benchmarks run on
other machines may give different results. Collecting details about the machine
that might explain those differences (in case someone cares to explore the
dataset) is part of the goal of the data model.
One concern, of course, is that people get wildly different results than a
benchmark says, and may say "Oh boo–the representative person from the company
made fake results that I can't replicate on my machine" ... and with details
about a system, performance differences can maybe be traced back to differences
in setup, because they were recorded.
Not all fields need to be filled out all the time. My priorities are:
# Identifying which fields flat-out wrong
# Differentiating between necessary columns and extraneous ones that can be
left null
To me, it is not a big deal to have an extra column dangling around that almost
nobody uses. No harm. (Unless it's mislabeled or otherwise wrong; that's what
I'm interested in getting out of the discussion here.)
> Define general benchmark database schema
> ----------------------------------------
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Benchmarking
> Reporter: Wes McKinney
> Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)