[jira] [Commented] (ARROW-4313) Define general benchmark database schema

Antoine Pitrou (JIRA) Wed, 20 Feb 2019 04:15:01 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772959#comment-16772959
 ]


Antoine Pitrou commented on ARROW-4313:
---------------------------------------

Indeed, agreed with Wes.

Just to answer one comment:

> there is no goal to bench other languages, as they rely on C++ library calls 
> and you will benchmark the wrapper conversion speed

It's a bit more involved than that. For example the speed of creating an Arrow 
array (or an Arrow dataframe) from Python objects is important, and this 
requires specific optimizations inside Arrow. Technically we _could_ benchmark 
it using the C++ infrastructure, it's just massively easier to write the 
benchmarks in Python using ASV, so that's what we're doing now.

That said, yes, recording C++ benchmark results is a good first-priority goal. 
The thing to keep in mind is that we don't want the adopted DB schema to limit 
ourselves in this regard.

(also, some implementations are not based on the C++ library, they are 
independent reimplementations of the Arrow data model, e.g. Java, C# or Rust)


> Define general benchmark database schema
> ----------------------------------------
>
>                 Key: ARROW-4313
>                 URL: https://issues.apache.org/jira/browse/ARROW-4313
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Benchmarking
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.13.0
>
>         Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>          Time Spent: 10h 50m
>  Remaining Estimate: 0h
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-4313) Define general benchmark database schema

Reply via email to