[
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755504#comment-16755504
]
Tanya Schlusser commented on ARROW-4313:
----------------------------------------
Thank you very much for everyone's detailed feedback. I absolutely need
guidance with the Machine / CPU / GPU specs. I have updated the
[^benchmark-data-model.png] and the [^benchmark-data-model.erdplus], and added
all of the recommended columns.
*Summary of changes:*
* All the dimension tables have been renamed to exclude the `_dim`. (It was to
distinguish dimension vs. fact tables.)
* `cpu`
** Added a `cpu_thread_count`.
** Changed `cpu.speed_Hz` to two columns: `frequency_max_Hz` and
`frequency_min_Hz` and also added a column `machine.overclock_frequency_Hz` to
the `machine` table to allow for overclocking like Wes mentioned in the
beginning.
* `os`
** Added both `os.architecture_name` and `os.architecture_bits`, the latter
forced to be in \{32, 64}, and pulled from the architecture name (maybe it will
become just a computed column in the joined view...). I think it's a good idea.
* `project`
** Added a `project.project_name` (oversight before)
* `benchmark_language`
** Split out `language` to `language_name` and `language_version` because
maybe people will want to compare between them (e.g. Python 2.7, 3.5+)
* `environment`
** Removed foreign key for `machine_id` — that should be in the benchmark
report separately. Many machines will have the same environment.
* `benchmark`
** Added foreign key for `benchmark_language_id`—a benchmark with the same
name may exist for different languages.
** Added foreign key for `project_id`—moved it from table `benchmark_result`
* `benchmark_result`
** Added foreign key for `machine_id` (was removed from `environment`)
** Deleted foreign key for `project_id`, placing it in `benchmark` (as stated
above)
*Questions*
* `cpu` and `gpu` dimension
** Is it a mistake to make `cpu.cpu_model_name` unique? I mean, are the LX
cache levels, core counts, or any other attribute ever different for the same
CPU model string?
** The same for GPU.
** I have commented the columns to say that `cpu_thread_count` corresponds to
`sysctl -n hw.logicalcpu` and `cpu_core_count` corresponds to `sysctl -n
hw.physicalcpu`; corrections gratefully accepted.
** Would it be less confusing to make the column names the exact same strings
as correspond to their value from `sysctl`, e.g. change `cpu.cpu_model_name` to
`cpu.cpu_brand_string` to correspond to the output of `sysctl -n
machdep.cpu.brand_string`?
** On that note is CPU RAM the same thing as `sysctl -n
machdep.cpu.cache.size`?
* `environment`
** I'm worried I'm doing something inelegant with the dependency list. It will
hold everything – Conda / virtualenv; versions of Numpy; all permutations of
the various dependencies in what in ASV is the dependency matrix.
> Define general benchmark database schema
> ----------------------------------------
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Benchmarking
> Reporter: Wes McKinney
> Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)