[ 
https://issues.apache.org/jira/browse/BEAM-4430?focusedWorklogId=111470&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-111470
 ]

ASF GitHub Bot logged work on BEAM-4430:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 13/Jun/18 11:19
            Start Date: 13/Jun/18 11:19
    Worklog Time Spent: 10m 
      Work Description: szewi commented on a change in pull request #465: 
[BEAM-4430] Improve Performance Testing Documentation
URL: https://github.com/apache/beam-site/pull/465#discussion_r195042613
 
 

 ##########
 File path: src/documentation/io/testing.md
 ##########
 @@ -220,31 +229,207 @@ Parameter descriptions:
       <td>Runner to be used for running the test. Currently possible options 
are: direct, dataflow.
       </td>
     </tr>
+    <tr>
+      <td>-DbeamExtraProperties
+      </td>
+      <td>Any other "extra properties" to be passed to Gradle, eg. 
"'[filesystem=hdfs]'". 
+      </td>
+    </tr>
   </tbody>
 </table>
 
-
-
 #### Without PerfKit Benchmarker {#without-perfkit-benchmarker}
 
-If you're using Kubernetes, make sure you can connect to your cluster locally 
using kubectl. Otherwise, skip to step 3 below.
+If you're using Kubernetes scripts to host data stores, make sure you can 
connect to your cluster locally using kubectl. If you have your own data stores 
already setup, you just need to execute step 3 from below list.
 
 1.  Set up the data store corresponding to the test you wish to run. You can 
find Kubernetes scripts for all currently supported data stores in 
[.test-infra/kubernetes](https://github.com/apache/beam/tree/master/.test-infra/kubernetes).
     1.  In some cases, there is a setup script (*.sh). In other cases, you can 
just run ``kubectl create -f [scriptname]`` to create the data store.
     1.  Convention dictates there will be:
-        1.  A core yml script for the data store itself, plus a `NodePort` 
service. The `NodePort` service opens a port to the data store for anyone who 
connects to the Kubernetes cluster's machines.
-        1.  A separate script, called for-local-dev, which sets up a 
LoadBalancer service.
+        1.  A yml script for the data store itself, plus a `NodePort` service. 
The `NodePort` service opens a port to the data store for anyone who connects 
to the Kubernetes cluster's machines from within same subnetwork. Such scripts 
are typically useful when running the scripts on Minikube Kubernetes Engine.
+        1.  A separate script, with LoadBalancer service. Such service will 
expose an _external ip_ for the datastore. Such scripts are needed when 
external access is required (eg. on Jenkins). 
     1.  Examples:
         1.  For JDBC, you can set up Postgres: `kubectl create -f 
.test-infra/kubernetes/postgres/postgres.yml`
         1.  For Elasticsearch, you can run the setup script: `bash 
.test-infra/kubernetes/elasticsearch/setup.sh`
 1.  Determine the IP address of the service:
     1.  NodePort service: `kubectl get pods -l 'component=elasticsearch' -o 
jsonpath={.items[0].status.podIP}`
     1.  LoadBalancer service:` kubectl get svc elasticsearch-external -o 
jsonpath='{.status.loadBalancer.ingress[0].ip}'`
-1.  Run the test using the instructions in the class (e.g. see the 
instructions in JdbcIOIT.java)
+1.  Run the test using `integrationTest` gradle task and the instructions in 
the test class (e.g. see the instructions in JdbcIOIT.java).
 1.  Tell Kubernetes to delete the resources specified in the Kubernetes 
scripts:
     1.  JDBC: `kubectl delete -f .test-infra/kubernetes/postgres/postgres.yml`
     1.  Elasticsearch: `bash .test-infra/kubernetes/elasticsearch/teardown.sh`
 
+##### integrationTest Task {#integration-test-task}
+
+Since `performanceTest` task involved running PerfkitBenchmarker, we can't use 
it to run the tests manually. For such purposes a more "low-level" task called 
`integrationTest` was introduced.  
+
+
+Example usage on Cloud Dataflow runner: 
+
+```
+./gradlew integrationTest -p sdks/java/io/hadoop-input-format 
-DintegrationTestPipelineOptions='["--project=GOOGLE_CLOUD_PROJECT", 
"--tempRoot=GOOGLE_STORAGE_BUCKET", "--numberOfRecords=1000", 
"--postgresPort=5432", "--postgresServerName=SERVER_NAME", 
"--postgresUsername=postgres", "--postgresPassword=PASSWORD", 
"--postgresDatabaseName=postgres", "--postgresSsl=false", 
"--runner=TestDataflowRunner"]' -DintegrationTestRunner=dataflow 
--tests=org.apache.beam.sdk.io.hadoop.inputformat.HadoopInputFormatIOIT
+```
+
+Example usage on HDFS filesystem and Direct runner: 
 
 Review comment:
   This will only work when /etc/hosts file will contain entries with hadoop 
namenode and hadoop datanodes external IPs, otherwise user will get 
`java.nio.channels.UnresolvedAddressException` It's worthy mentioning, however 
this info is already in comment section of yml files. I will suggest at least 
adding:
   
   `Example usage on HDFS filesystem and Direct runner (with /etc/hosts entries 
added):`
   
   make people aware of what need to be done before running this with 
DirectRunner.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 111470)
    Time Spent: 20m  (was: 10m)

> Improve Performance Testing Documentation
> -----------------------------------------
>
>                 Key: BEAM-4430
>                 URL: https://issues.apache.org/jira/browse/BEAM-4430
>             Project: Beam
>          Issue Type: Wish
>          Components: testing
>            Reporter: Łukasz Gajowy
>            Assignee: Łukasz Gajowy
>            Priority: Critical
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, the only documentation regarding IO Performance Testing can be 
> found here: 
> [https://beam.apache.org/documentation/io/testing/#i-o-transform-integration-tests].
>  This is certainly not enough given that the performance testing framework 
> currently allows to run tests:
>  - on local or hdfs filesystems
>  - on direct or dataflow runners
>  - manually using integrationTest task
>  - automatically using performanceTest task
>  - using pkb.py tool directly (PerfKitBenchmarker)
>  - on demand from pending Pull Requests 
>  - detecting anomalies
>  - gathering results in dashboards
> All the above bullets (and maybe others - to be investigated) need more 
> explanation in the docs to make the Performance Testing Framework usable by 
> the broader community.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to