[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755504#comment-16755504
 ] 

Tanya Schlusser commented on ARROW-4313:
----------------------------------------

Thank you very much for everyone's detailed feedback. I absolutely need 
guidance with the Machine / CPU / GPU specs. I have updated the 
[^benchmark-data-model.png] and the [^benchmark-data-model.erdplus], and added 
all of the recommended columns.

 

*Summary of changes:*
 * All the dimension tables have been renamed to exclude the `_dim`. (It was to 
distinguish dimension vs. fact tables.)

 * `cpu`
 ** Added a `cpu_thread_count`. 
 ** Changed `cpu.speed_Hz` to two columns: `frequency_max_Hz` and 
`frequency_min_Hz` and also added a column `machine.overclock_frequency_Hz` to 
the `machine` table to allow for overclocking like Wes mentioned in the 
beginning.

 * `os`
 ** Added both `os.architecture_name` and `os.architecture_bits`, the latter 
forced to be in \{32, 64}, and pulled from the architecture name (maybe it will 
become just a computed column in the joined view...). I think it's a good idea.

 * `project`
 ** Added a `project.project_name` (oversight before)

 * `benchmark_language`
 ** Split out `language` to `language_name` and `language_version` because 
maybe people will want to compare between them (e.g. Python 2.7, 3.5+)

 * `environment`
 ** Removed foreign key for `machine_id` — that should be in the benchmark 
report separately. Many machines will have the same environment.

 * `benchmark`
 ** Added foreign key for `benchmark_language_id`—a benchmark with the same 
name may exist for different languages.
 ** Added foreign key for `project_id`—moved it from table `benchmark_result`

 * `benchmark_result`
 ** Added foreign key for `machine_id` (was removed from `environment`)
 ** Deleted foreign key for `project_id`, placing it in `benchmark` (as stated 
above)

*Questions*
 * `cpu` and `gpu` dimension
 ** Is it a mistake to make `cpu.cpu_model_name` unique? I mean, are the LX 
cache levels, core counts, or any other attribute ever different for the same 
CPU model string?
 ** The same for GPU.
 ** I have commented the columns to say that  `cpu_thread_count` corresponds to 
`sysctl -n hw.logicalcpu` and `cpu_core_count` corresponds to `sysctl -n 
hw.physicalcpu`; corrections gratefully accepted.
 ** Would it be less confusing to make the column names the exact same strings 
as correspond to their value from `sysctl`, e.g. change `cpu.cpu_model_name` to 
`cpu.cpu_brand_string` to correspond to the output of `sysctl -n 
machdep.cpu.brand_string`?
 ** On that note is CPU RAM the same thing as `sysctl -n 
machdep.cpu.cache.size`?
 * `environment`
 ** I'm worried I'm doing something inelegant with the dependency list. It will 
hold everything – Conda / virtualenv; versions of Numpy; all permutations of 
the various dependencies in what in ASV is the dependency matrix.

> Define general benchmark database schema
> ----------------------------------------
>
>                 Key: ARROW-4313
>                 URL: https://issues.apache.org/jira/browse/ARROW-4313
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Benchmarking
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 0.13.0
>
>         Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to