Re: Benchmarking dashboard proposal

Antoine Pitrou Wed, 20 Feb 2019 10:14:09 -0800


Le 20/02/2019 à 18:55, Melik-Adamyan, Areg a écrit :
> 2. The unit-tests framework (google/benchmark) allows to effectively report 
> in textual format the needed data on benchmark with preamble containing 
> information about the machine on which the benchmarks are run.


On this topic, gbenchmark actually can output JSON, e.g.:
./build/release/arrow-utf8-util-benchmark --benchmark_out=results.json 
--benchmark_out_format=json

Here is how the JSON output looks like:
https://gist.github.com/pitrou/e055b454f333adf3c16325613c716309

Using this data it should be easy to massage an ingestion script
that gives it to the database in the expected format.

> - Disallow to enter data to the central repo any single benchmarks run, as 
> they do not mean much in the case of continuous and statistically relevant 
> measurements. [...]
> - Mandate the contributors to have dedicated environment for measurements.

I have no strong opinion on this.  Another possibility is to regard one set
of machines (e.g. Intel- or Ursa Labs-provided benchmarking machines, such
as the DGX machines currently at Wes' office) as the reference for tracking
regressions, and other machines as just informational.

That said, I think you're right that it doesn't sound very useful to allow
arbitrary benchmark result submissions.  However, I think there could still
be a separate test database instance, to allow easy testing of ingestion
or reporting scripts.

Regards

Antoine.


> 3. So with environments set and regular runs you have all the artifacts, 
> though not in a very comprehensible format. So the reason to set a dashboard 
> is to allow to consume data and be able to track performance of various parts 
> on a historical perspective and much more nicely with visualizations. 
> And here are the scope restrictions I have in mind:
> - Disallow to enter data to the central repo any single benchmarks run, as 
> they do not mean much in the case of continuous and statistically relevant 
> measurements. What information you will get if someone reports some single 
> run? You do not know how clean it was done, and more importantly is it 
> possible to reproduce elsewhere. That is why even if it is better, worse or 
> the same you cannot compare with the data already in the DB.
> - Mandate the contributors to have dedicated environment for measurements. 
> Otherwise they can use the TeamCity to run and parse data and publish on 
> their site. Data that enters Arrow performance DB becomes Arrow community 
> owned data. And it becomes community's job to answer why certain things are 
> better or worse.
> -  Because the numbers and flavors for CPU/GPU/accelerators are huge we 
> cannot satisfy all the needs upfront and create DB that covers all the 
> possible variants. I think we should have simple CPU and GPU configs now, 
> even if they will not be perfect. By simple I mean basic brand string. That 
> should be enough. Having all the detailed info in the DB does not make sense, 
> as my experience is telling, you never use them, you use the CPUID/brandname 
> to get the info needed.
> - Scope and reqs will change during the time and going huge now will make 
> things complicated later. So I think it will be beneficial to have something 
> quick up and running, get better understanding of our needs and gaps, and go 
> from there. 
> The needed infra is already up on AWS, so as soon as we resolve DNS and key 
> exchange issues we can launch.
> 
> -Areg.
> 
> -----Original Message-----
> From: Tanya Schlusser [mailto:ta...@tickel.net] 
> Sent: Thursday, February 7, 2019 4:40 PM
> To: dev@arrow.apache.org
> Subject: Re: Benchmarking dashboard proposal
> 
> Late, but there's a PR now with first-draft DDL ( 
> https://github.com/apache/arrow/pull/3586).
> Happy to receive any feedback!
> 
> I tried to think about how people would submit benchmarks, and added a 
> Postgraphile container for http-via-GraphQL.
> If others have strong opinions on the data modeling please speak up because 
> I'm more a database user than a designer.
> 
> I can also help with benchmarking work in R/Python given guidance/a 
> roadmap/examples from someone else.
> 
> Best,
> Tanya
> 
> On Mon, Feb 4, 2019 at 12:37 PM Tanya Schlusser <ta...@tickel.net> wrote:
> 
>> I hope to make a PR with the DDL by tomorrow or Wednesday night—DDL 
>> along with a README in a new directory `arrow/dev/benchmarking` unless 
>> directed otherwise.
>>
>> A "C++ Benchmark Collector" script would be super. I expect some 
>> back-and-forth on this to identify naïve assumptions in the data model.
>>
>> Attempting to submit actual benchmarks is how to get a handle on that. 
>> I recognize I'm blocking downstream work. Better to get an initial PR 
>> and some discussion going.
>>
>> Best,
>> Tanya
>>
>> On Mon, Feb 4, 2019 at 10:10 AM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>>> hi folks,
>>>
>>> I'm curious where we currently stand on this project. I see the 
>>> discussion in https://issues.apache.org/jira/browse/ARROW-4313 -- 
>>> would the next step be to have a pull request with .sql files 
>>> containing the DDL required to create the schema in PostgreSQL?
>>>
>>> I could volunteer to write the "C++ Benchmark Collector" script that 
>>> will run all the benchmarks on Linux and collect their data to be 
>>> inserted into the database.
>>>
>>> Thanks
>>> Wes
>>>
>>> On Sun, Jan 27, 2019 at 12:20 AM Tanya Schlusser <ta...@tickel.net>
>>> wrote:
>>>>
>>>> I don't want to be the bottleneck and have posted an initial draft 
>>>> data model in the JIRA issue
>>> https://issues.apache.org/jira/browse/ARROW-4313
>>>>
>>>> It should not be a problem to get content into a form that would be 
>>>> acceptable for either a static site like ASV (via CORS queries to a 
>>>> GraphQL/REST interface) or a codespeed-style site (via a separate 
>>>> schema organized for Django)
>>>>
>>>> I don't think I'm experienced enough to actually write any 
>>>> benchmarks though, so all I can contribute is backend work for this task.
>>>>
>>>> Best,
>>>> Tanya
>>>>
>>>> On Sat, Jan 26, 2019 at 7:37 PM Wes McKinney <wesmck...@gmail.com>
>>> wrote:
>>>>
>>>>> hi folks,
>>>>>
>>>>> I'd like to propose some kind of timeline for getting a first 
>>>>> iteration of a benchmark database developed and live, with 
>>>>> scripts to enable one or more initial agents to start adding new 
>>>>> data on a daily / per-commit basis. I have at least 3 physical 
>>>>> machines where I could immediately set up cron jobs to start 
>>>>> adding new data, and I could attempt to backfill data as far back as 
>>>>> possible.
>>>>>
>>>>> Personally, I would like to see this done by the end of February 
>>>>> if not sooner -- if we don't have the volunteers to push the work 
>>>>> to completion by then please let me know as I will rearrange my 
>>>>> priorities to make sure that it happens. Does that sounds reasonable?
>>>>>
>>>>> Please let me know if this plan sounds reasonable:
>>>>>
>>>>> * Set up a hosted PostgreSQL instance, configure backups
>>>>> * Propose and adopt a database schema for storing benchmark 
>>>>> results
>>>>> * For C++, write script (or Dockerfile) to execute all 
>>>>> google-benchmarks, output results to JSON, then adapter script
>>>>> (Python) to ingest into database
>>>>> * For Python, similar script that invokes ASV, then inserts ASV 
>>>>> results into benchmark database
>>>>>
>>>>> This seems to be a pre-requisite for having a front-end to 
>>>>> visualize the results, but the dashboard/front end can hopefully 
>>>>> be implemented in such a way that the details of the benchmark 
>>>>> database are not too tightly coupled
>>>>>
>>>>> (Do we have any other benchmarks in the project that would need 
>>>>> to be inserted initially?)
>>>>>
>>>>> Related work to trigger benchmarks on agents when new commits 
>>>>> land in master can happen concurrently -- one task need not block 
>>>>> the other
>>>>>
>>>>> Thanks
>>>>> Wes
>>>>>
>>>>> On Mon, Jan 21, 2019 at 11:14 AM Wes McKinney 
>>>>> <wesmck...@gmail.com>
>>> wrote:
>>>>>>
>>>>>> Sorry, copy-paste failure:
>>>>> https://issues.apache.org/jira/browse/ARROW-4313
>>>>>>
>>>>>> On Mon, Jan 21, 2019 at 11:14 AM Wes McKinney 
>>>>>> <wesmck...@gmail.com>
>>>>> wrote:
>>>>>>>
>>>>>>> I don't think there is one but I just created
>>>>>>>
>>>>>
>>> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c52
>>> 91a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E
>>>>>>>
>>>>>>> On Mon, Jan 21, 2019 at 10:35 AM Tanya Schlusser <
>>> ta...@tickel.net>
>>>>> wrote:
>>>>>>>>
>>>>>>>> Areg,
>>>>>>>>
>>>>>>>> If you'd like help, I volunteer! No experience benchmarking 
>>>>>>>> but
>>> tons
>>>>>>>> experience databasing—I can mock the backend (database + 
>>>>>>>> http)
>>> as a
>>>>>>>> starting point for discussion if this is the way people 
>>>>>>>> want to
>>> go.
>>>>>>>>
>>>>>>>> Is there a Jira ticket for this that i can jump into?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Jan 20, 2019 at 3:24 PM Wes McKinney <
>>> wesmck...@gmail.com>
>>>>> wrote:
>>>>>>>>
>>>>>>>>> hi Areg,
>>>>>>>>>
>>>>>>>>> This sounds great -- we've discussed building a more
>>> full-featured
>>>>>>>>> benchmark automation system in the past but nothing has 
>>>>>>>>> been
>>>>> developed
>>>>>>>>> yet.
>>>>>>>>>
>>>>>>>>> Your proposal about the details sounds OK; the single 
>>>>>>>>> most
>>>>> important
>>>>>>>>> thing to me is that we build and maintain a very general
>>> purpose
>>>>>>>>> database schema for building the historical benchmark 
>>>>>>>>> database
>>>>>>>>>
>>>>>>>>> The benchmark database should keep track of:
>>>>>>>>>
>>>>>>>>> * Timestamp of benchmark run
>>>>>>>>> * Git commit hash of codebase
>>>>>>>>> * Machine unique name (sort of the "user id")
>>>>>>>>> * CPU identification for machine, and clock frequency (in
>>> case of
>>>>>>>>> overclocking)
>>>>>>>>> * CPU cache sizes (L1/L2/L3)
>>>>>>>>> * Whether or not CPU throttling is enabled (if it can be
>>> easily
>>>>> determined)
>>>>>>>>> * RAM size
>>>>>>>>> * GPU identification (if any)
>>>>>>>>> * Benchmark unique name
>>>>>>>>> * Programming language(s) associated with benchmark (e.g. 
>>>>>>>>> a
>>>>> benchmark
>>>>>>>>> may involve both C++ and Python)
>>>>>>>>> * Benchmark time, plus mean and standard deviation if
>>> available,
>>>>> else NULL
>>>>>>>>>
>>>>>>>>> (maybe some other things)
>>>>>>>>>
>>>>>>>>> I would rather not be locked into the internal database
>>> schema of a
>>>>>>>>> particular benchmarking tool. So people in the community 
>>>>>>>>> can
>>> just
>>>>> run
>>>>>>>>> SQL queries against the database and use the data however 
>>>>>>>>> they
>>>>> like.
>>>>>>>>> We'll just have to be careful that people don't DROP 
>>>>>>>>> TABLE or
>>>>> DELETE
>>>>>>>>> (but we should have daily backups so we can recover from 
>>>>>>>>> such
>>>>> cases)
>>>>>>>>>
>>>>>>>>> So while we may make use of TeamCity to schedule the runs 
>>>>>>>>> on
>>> the
>>>>> cloud
>>>>>>>>> and physical hardware, we should also provide a path for 
>>>>>>>>> other
>>>>> people
>>>>>>>>> in the community to add data to the benchmark database on
>>> their
>>>>>>>>> hardware on an ad hoc basis. For example, I have several
>>> machines
>>>>> in
>>>>>>>>> my home on all operating systems (Windows / macOS / 
>>>>>>>>> Linux,
>>> and soon
>>>>>>>>> also ARM64) and I'd like to set up scheduled tasks / cron
>>> jobs to
>>>>>>>>> report in to the database at least on a daily basis.
>>>>>>>>>
>>>>>>>>> Ideally the benchmark database would just be a PostgreSQL
>>> server
>>>>> with
>>>>>>>>> a schema we write down and keep backed up etc. Hosted
>>> PostgreSQL is
>>>>>>>>> inexpensive ($200+ per year depending on size of 
>>>>>>>>> instance;
>>> this
>>>>>>>>> probably doesn't need to be a crazy big machine)
>>>>>>>>>
>>>>>>>>> I suspect there will be a manageable amount of 
>>>>>>>>> development
>>>>> involved to
>>>>>>>>> glue each of the benchmarking frameworks together with 
>>>>>>>>> the
>>>>> benchmark
>>>>>>>>> database. This can also handle querying the operating 
>>>>>>>>> system
>>> for
>>>>> the
>>>>>>>>> system information listed above
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Wes
>>>>>>>>>
>>>>>>>>> On Fri, Jan 18, 2019 at 12:14 AM Melik-Adamyan, Areg 
>>>>>>>>> <areg.melik-adam...@intel.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I want to restart/attach to the discussions for 
>>>>>>>>>> creating
>>> Arrow
>>>>>>>>> benchmarking dashboard. I want to propose performance
>>> benchmark
>>>>> run per
>>>>>>>>> commit to track the changes.
>>>>>>>>>> The proposal includes building infrastructure for 
>>>>>>>>>> per-commit
>>>>> tracking
>>>>>>>>> comprising of the following parts:
>>>>>>>>>> - Hosted JetBrains for OSS 
>>>>>>>>>> https://teamcity.jetbrains.com/
>>> as a
>>>>> build
>>>>>>>>> system
>>>>>>>>>> - Agents running in cloud both VM/container 
>>>>>>>>>> (DigitalOcean,
>>> or
>>>>> others)
>>>>>>>>> and bare-metal (Packet.net/AWS) and on-premise(Nvidia 
>>>>>>>>> boxes?)
>>>>>>>>>> - JFrog artifactory storage and management for OSS 
>>>>>>>>>> projects
>>>>>>>>> https://jfrog.com/open-source/#artifactory2
>>>>>>>>>> - Codespeed as a frontend
>>> https://github.com/tobami/codespeed
>>>>>>>>>>
>>>>>>>>>> I am volunteering to build such system (if needed more 
>>>>>>>>>> Intel
>>>>> folks will
>>>>>>>>> be involved) so we can start tracking performance on 
>>>>>>>>> various
>>>>> platforms and
>>>>>>>>> understand how changes affect it.
>>>>>>>>>>
>>>>>>>>>> Please, let me know your thoughts!
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -Areg.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>
>>

Re: Benchmarking dashboard proposal

Reply via email to