[
https://issues.apache.org/jira/browse/BEAM-4430?focusedWorklogId=111472&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-111472
]
ASF GitHub Bot logged work on BEAM-4430:
----------------------------------------
Author: ASF GitHub Bot
Created on: 13/Jun/18 11:19
Start Date: 13/Jun/18 11:19
Worklog Time Spent: 10m
Work Description: szewi commented on a change in pull request #465:
[BEAM-4430] Improve Performance Testing Documentation
URL: https://github.com/apache/beam-site/pull/465#discussion_r195044440
##########
File path: src/documentation/io/testing.md
##########
@@ -220,31 +229,207 @@ Parameter descriptions:
<td>Runner to be used for running the test. Currently possible options
are: direct, dataflow.
</td>
</tr>
+ <tr>
+ <td>-DbeamExtraProperties
+ </td>
+ <td>Any other "extra properties" to be passed to Gradle, eg.
"'[filesystem=hdfs]'".
+ </td>
+ </tr>
</tbody>
</table>
-
-
#### Without PerfKit Benchmarker {#without-perfkit-benchmarker}
-If you're using Kubernetes, make sure you can connect to your cluster locally
using kubectl. Otherwise, skip to step 3 below.
+If you're using Kubernetes scripts to host data stores, make sure you can
connect to your cluster locally using kubectl. If you have your own data stores
already setup, you just need to execute step 3 from below list.
1. Set up the data store corresponding to the test you wish to run. You can
find Kubernetes scripts for all currently supported data stores in
[.test-infra/kubernetes](https://github.com/apache/beam/tree/master/.test-infra/kubernetes).
1. In some cases, there is a setup script (*.sh). In other cases, you can
just run ``kubectl create -f [scriptname]`` to create the data store.
1. Convention dictates there will be:
- 1. A core yml script for the data store itself, plus a `NodePort`
service. The `NodePort` service opens a port to the data store for anyone who
connects to the Kubernetes cluster's machines.
- 1. A separate script, called for-local-dev, which sets up a
LoadBalancer service.
+ 1. A yml script for the data store itself, plus a `NodePort` service.
The `NodePort` service opens a port to the data store for anyone who connects
to the Kubernetes cluster's machines from within same subnetwork. Such scripts
are typically useful when running the scripts on Minikube Kubernetes Engine.
+ 1. A separate script, with LoadBalancer service. Such service will
expose an _external ip_ for the datastore. Such scripts are needed when
external access is required (eg. on Jenkins).
1. Examples:
1. For JDBC, you can set up Postgres: `kubectl create -f
.test-infra/kubernetes/postgres/postgres.yml`
1. For Elasticsearch, you can run the setup script: `bash
.test-infra/kubernetes/elasticsearch/setup.sh`
1. Determine the IP address of the service:
1. NodePort service: `kubectl get pods -l 'component=elasticsearch' -o
jsonpath={.items[0].status.podIP}`
1. LoadBalancer service:` kubectl get svc elasticsearch-external -o
jsonpath='{.status.loadBalancer.ingress[0].ip}'`
-1. Run the test using the instructions in the class (e.g. see the
instructions in JdbcIOIT.java)
+1. Run the test using `integrationTest` gradle task and the instructions in
the test class (e.g. see the instructions in JdbcIOIT.java).
1. Tell Kubernetes to delete the resources specified in the Kubernetes
scripts:
1. JDBC: `kubectl delete -f .test-infra/kubernetes/postgres/postgres.yml`
1. Elasticsearch: `bash .test-infra/kubernetes/elasticsearch/teardown.sh`
+##### integrationTest Task {#integration-test-task}
+
+Since `performanceTest` task involved running PerfkitBenchmarker, we can't use
it to run the tests manually. For such purposes a more "low-level" task called
`integrationTest` was introduced.
+
+
+Example usage on Cloud Dataflow runner:
+
+```
+./gradlew integrationTest -p sdks/java/io/hadoop-input-format
-DintegrationTestPipelineOptions='["--project=GOOGLE_CLOUD_PROJECT",
"--tempRoot=GOOGLE_STORAGE_BUCKET", "--numberOfRecords=1000",
"--postgresPort=5432", "--postgresServerName=SERVER_NAME",
"--postgresUsername=postgres", "--postgresPassword=PASSWORD",
"--postgresDatabaseName=postgres", "--postgresSsl=false",
"--runner=TestDataflowRunner"]' -DintegrationTestRunner=dataflow
--tests=org.apache.beam.sdk.io.hadoop.inputformat.HadoopInputFormatIOIT
+```
+
+Example usage on HDFS filesystem and Direct runner:
+
+```
+export HADOOP_USER_NAME=root
+
+./gradlew integrationTest -p sdks/java/io/file-based-io-tests
-DintegrationTestPipelineOptions='["--numberOfRecords=1000",
"--filenamePrefix=hdfs://HDFS_NAMENODE:9000/XMLIOIT",
"--hdfsConfiguration=[{\"fs.defaultFS\":\"hdfs://HDFS_NAMENODE:9000\",\"dfs.replication\":1,\"dfs.client.use.datanode.hostname\":\"true\"
}]" ]' -DintegrationTestRunner=direct -Dfilesystem=hdfs --tests
org.apache.beam.sdk.io.xml.XmlIOIT
+```
+
+Parameter descriptions:
+
+
+<table class="table">
+ <thead>
+ <tr>
+ <td>
+ <strong>Option</strong>
+ </td>
+ <td>
+ <strong>Function</strong>
+ </td>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>-p sdks/java/io/file-based-io-tests/
+ </td>
+ <td>Specifies the project submodule of the I/O to test.
+ </td>
+ </tr>
+ <tr>
+ <td>-DintegrationTestPipelineOptions
+ </td>
+ <td>Passes pipeline options directly to the test being run.
+ </td>
+ </tr>
+ <tr>
+ <td>-DintegrationTestRunner
+ </td>
+ <td>Runner to be used for running the test. Currently possible options
are: direct, dataflow.
+ </td>
+ </tr>
+ <tr>
+ <td>-Dfilesystem
+ </td>
+ <td>(optional, where applicable) Filesystem to be used to run the test.
Currently possible options are: gcs, hdfs, s3. If not provided, local
filesystem will be used.
+ </td>
+ </tr>
+ <tr>
+ <td>--tests
+ </td>
+ <td>Specifies the test to be run (fully qualified reference to class/test
method).
+ </td>
+ </tr>
+ </tbody>
+</table>
+
+#### Running Integration Tests on Pull Requests {#running-on-pull-requests}
+
+Thanks to [ghprb](https://github.com/janinko/ghprb) plugin it is possible to
run Jenkins jobs when specific phrase is typed in a Github Pull Request's
comment. Integration tests that have Jenkins job defined can be triggered this
way. You can run integration tests using these phrases:
+
+<table class="table">
+ <thead>
+ <tr>
+ <td>
+ <strong>Test</strong>
+ </td>
+ <td>
+ <strong>Phrase</strong>
+ </td>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>JdbcIOIT
+ </td>
+ <td>Run Java JdbcIO Performance Test
+ </td>
+ </tr>
+ <tr>
+ <td>MongoDBIOIT
+ </td>
+ <td>Run Java MongoDBIO Performance Test
+ </td>
+ </tr>
+ <tr>
+ <td>HadoopInputFormatIOIT
+ </td>
+ <td>Run Java HadoopInputFormatIO Performance Test
+ </td>
+ </tr>
+ <tr>
+ <td>TextIO - local filesystem
+ </td>
+ <td>Run Java TextIO Performance Test
+ </td>
+ </tr>
+ <tr>
+ <td>TextIO - HDFS
+ </td>
+ <td>Run Java TextIO Performance Test HDFS
+ </td>
+ </tr>
+ <tr>
+ <td>Compressed TextIO - local filesystem
+ </td>
+ <td>Run Java CompressedTextIO Performance Test
+ </td>
+ </tr>
+ <tr>
+ <td>Compressed TextIO - HDFS
+ </td>
+ <td>Run Java CompressedTextIO Performance Test HDFS
+ </td>
+ </tr>
+ <tr>
+ <td>AvroIO - local filesystem
+ </td>
+ <td>Run Java AvroIO Performance Test
+ </td>
+ </tr>
+ <tr>
+ <td>AvroIO - HDFS
+ </td>
+ <td>Run Java AvroIO Performance Test HDFS
+ </td>
+ </tr>
+ <tr>
+ <td>TFRecordIO - local filesystem
+ </td>
+ <td>Run Java TFRecordIO Performance Test
+ </td>
+ </tr>
+ <tr>
+ <td>ParquetIO - local filesystem
+ </td>
+ <td>Run Java ParquetIO Performance Test
+ </td>
+ </tr>
+ <tr>
+ <td>XmlIO - local filesystem
+ </td>
+ <td>Run Java XmlIO Performance Test
+ </td>
+ </tr>
+ <tr>
+ <td>XmlIO - HDFS
+ </td>
+ <td>Run Java XmlIO Performance Test on HDFS
+ </td>
+ </tr>
+ </tbody>
+</table>
+
+Every job definition can be found in
[.test-infra/jenkins](https://github.com/apache/beam/tree/master/.test-infra/jenkins).
+If you modified/added new Jenkins job definitions in your Pull Request, run
the seed job before running the integration test (comment: "Run seed job").
+
+### Performance testing dashboard {#performance-testing-dashboard}
+
+We mesure the performance of IOITs by gathering test execution times from
Jenkins jobs that run periodically. The consequent results are stored in a
database (BigQuery), therefore we can display them in a form of plots.
Review comment:
nit: mesure -> measure
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 111472)
Time Spent: 40m (was: 0.5h)
> Improve Performance Testing Documentation
> -----------------------------------------
>
> Key: BEAM-4430
> URL: https://issues.apache.org/jira/browse/BEAM-4430
> Project: Beam
> Issue Type: Wish
> Components: testing
> Reporter: Łukasz Gajowy
> Assignee: Łukasz Gajowy
> Priority: Critical
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Currently, the only documentation regarding IO Performance Testing can be
> found here:
> [https://beam.apache.org/documentation/io/testing/#i-o-transform-integration-tests].
> This is certainly not enough given that the performance testing framework
> currently allows to run tests:
> - on local or hdfs filesystems
> - on direct or dataflow runners
> - manually using integrationTest task
> - automatically using performanceTest task
> - using pkb.py tool directly (PerfKitBenchmarker)
> - on demand from pending Pull Requests
> - detecting anomalies
> - gathering results in dashboards
> All the above bullets (and maybe others - to be investigated) need more
> explanation in the docs to make the Performance Testing Framework usable by
> the broader community.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)