Abacn commented on code in PR #30879: URL: https://github.com/apache/beam/pull/30879#discussion_r1563292058
########## contributor-docs/code-change-guide.md: ########## @@ -0,0 +1,518 @@ +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +This guide is for Beam users and developers changing and testing Beam code. +Specifically, this guide provides information about: + +1. Testing code changes locally + +2. Building Beam artifacts with modified Beam code and using it for pipelines + +# Repository structure + +The Apache Beam GitHub repository (Beam repo) is pretty much a "mono repo", +containing everything in the Beam project, including the SDK, test +infrastructure, dashboards, the [Beam website](https://beam.apache.org), +the [Beam Playground](https://play.beam.apache.org), and so on. + +## Gradle quick start + +The Beam repo is a single Gradle project that contains all components, including Python, +Go, the website, etc. It is useful to familiarize yourself with the concept of Gradle project structure: +https://docs.gradle.org/current/userguide/multi_project_builds.html + +### Gradle key concepts + +* project: folder with build.gradle +* task: defined in build.gradle +* plugin: run in project build.gradle, pre-defined tasks and hierarchies + +For example, common tasks for a Java project or subproject include: + +- compileJava +- compileTestJava +- test +- integrationTest + +To run a Gradle task, the command is `./gradlew -p <project path> <task>` or equivalently, `./gradlew :project:path:task_name`. For example: + +``` +./gradlew -p sdks/java/core compileJava + +./gradlew :sdks:java:harness:test +``` + +### Gradle project configuration: Beam specific + +* A **huge** plugin `buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin` manages everything. + +Then, in each java project or subproject, the `build.gradle` file starts with: + +```groovy + +apply plugin: 'org.apache.beam.module' + +applyJavaNature( ... ) +``` + +Relevant usage of `BeamModulePlugin` includes: +* Manage Java dependencies +* Configuring projects (Java, Python, Go, Proto, Docker, Grpc, Avro, an so on) + * Java -> applyJavaNature; Python -> applyPythonNature, and so on + * Define common custom tasks for each type of project + * `test`: run Java unit tests + * `spotlessApply`: format java code + +## Code paths + +The following are example code paths relevant for SDK development: + +* `sdks/java` Java SDK + * `sdks/java/core` Java core + * `sdks/java/harness` SDK harness (entrypoint of SDK container) + +* `runners` runner supports, written in Java. For example, + * `runners/direct-java` Java direct runner + * `runners/flink-java` Java Flink runner + * `runners/google-cloud-dataflow-java` Dataflow runner (job submission, translation, etc) + * `runners/google-cloud-dataflow-java/`worker Worker on Dataflow legacy runner + +* `sdks/python` contains setup file and scripts to trigger test-suites + * `sdks/python/apache_beam` actual beam package + * `sdks/python/apache_beam/runners/worker` SDK worker harness entrypoint, state sampler + * `sdks/python/apache_beam/io` IO connectors + * `sdks/python/apache_beam/transforms` most "core" components + * `sdks/python/apache_beam/ml` Beam ML + * `sdks/python/apache_beam/runners` runner implementations and wrappers + * ... + +* `sdks/go` Go SDK + +* `.github/workflow` GitHub Action workflows (for example, tests run under PR). Most + workflows run a single Gradle command. Check which command is running for + a test so that you can run the same command locally during development. + +## Environment setup + +To set up local development environments, first see the [Contributing guide](../CONTRIBUTING.md) . +If you plan to use Dataflow, see the [Google Cloud documentation](https://cloud.google.com/dataflow/docs/quickstarts/create-pipeline-java) to setup `gcloud` credentials. + +To check if your environment is set up, follow these steps: + +Your `PATH` needs to have the following elements configured. + +* A Java environment (any supported Java version, Java8 preferably as of 2024). + * This environment is needed for all development, because Beam is a Gradle project that uses JVM. + * Recommended: Use [sdkman]((https://sdkman.io/install)) to manage Java versions. +* A Python environment (any supported Python version) + * Needed for Python SDK development + * Recommended: Use [pyenv](https://github.com/pyenv/pyenv) and + a [virtual environment](https://docs.python.org/3/library/venv.html) to manage Python versions. +* A Go environment. Install the latest Go version. + * Needed for Go SDK development and SDK container change (for all SDKs), because + the container entrypoint scripts are written in Go. +* A Docker environment. + * Needed for SDK container changes, some cross-language functionality (if you run a + SDK container image), portable runners (using job server), etc. + +For example: +- When you test the code change in `sdks/java/io/google-cloud-platform`, you need a Java environment. +- When you test the code change in `sdks/java/harness`, you need a Java environment, a Go + environment, and Docker environment. You need the Docker environment to compile and build the Java SDK harness container image. +- When you test the code change in `sdks/python/apache_beam`, you need a Python environment. + +# Java guide +This section provides guidance for setting up your Java environment. +## IDE (IntelliJ) setup + +1. From IntelliJ, open `/beam` (**Important:** Open the repository root directory, instead of + `sdks/java`). + +2. Wait for indexing. Indexing might take a few minutes. + +If the prerequisites are met, the environment set up is complete, because Gradle is a self-contained build tool. + +To verify whether the load is successful, find the file `examples/java/build.gradle`. Next to the wordCount task, +a **Run** button is present. Click **Run**. The wordCount example compiles and runs. + +<img width="631" alt="image" src="https://github.com/apache/beam/assets/8010435/f5203e8e-0f9c-4eaa-895b-e16f68a808a2"> + +**Note:** The IDE is not required for changing the code and testing. +You can run tests can by using a Gradle command line, as described in the Console (shell) setup section. + +## Console (shell) setup + +To run tests by using the Gradle command line, in the command-line environment, run the following command: + +```shell +$ cd beam +$ ./gradlew :examples:java:wordCount +``` + +When the command completes successfully, the following text appears in the Gradle build log: + +``` +... +BUILD SUCCESSFUL in 2m 32s +96 actionable tasks: 9 executed, 87 up-to-date +3:41:06 PM: Execution finished 'wordCount'. +``` + +In addition, the following text appears in the output file: + +```shell + +$ head /tmp/output.txt* +==> /tmp/output.txt-00000-of-00003 <== +should: 38 +bites: 1 +depraved: 1 +gauntlet: 1 +battle: 6 +sith: 2 +cools: 1 +natures: 1 +hedge: 1 +words: 9 + +==> /tmp/output.txt-00001-of-00003 <== +elements: 1 +Advise: 2 +fearful: 2 +towards: 4 +ready: 8 +pared: 1 +left: 8 +safe: 4 +canst: 7 +warrant: 2 + +==> /tmp/output.txt-00002-of-00003 <== +chanced: 1 +... +``` + +*What does this command do?* + +This command compiles the beam SDK and the WordCount pipeline, a Hello-world program for +data processing, then runs the pipeline on the Direct Runner. + +## Run a unit test + +This section explains how to run unit tests locally after you make a code change in the Java SDK (for example, in `sdks/java/io/jdbc`). + +Tests are under the `src/test/java` folder of each project. Unit tests have the filename `.../**Test.java`. Integration tests have the filename `.../**IT.java` for integration tests. + +* To run all unit tests under a project, use the following command: + ``` + ./gradlew :sdks:java:harness:test + ``` + Find the JUnit report in an HTML file in the file path `<invoked_project>/build/reports/tests/test/index.html`. + +* To run a specific test, use the following commands: + ``` + ./gradlew :sdks:java:harness:test --tests org.apache.beam.fn.harness.CachesTest + ./gradlew :sdks:java:harness:test --tests *CachesTest + ./gradlew :sdks:java:harness:test --tests *CachesTest.testClearableCache + ``` + +* To run tests using IntelliJ, click the ticks to run a whole test class or a specific test. You can set breakpoints to debug the test. + <img width="452" alt="image" src="https://github.com/apache/beam/assets/8010435/7ae2a65c-a104-48a2-8bad-ff8c52dd1943"> + +* These steps don't apply to `sdks:java:core` tests. Instead, invoke the unit tests by using `:runners:direct-java:needsRunnerTest`. Java core doesn't depend on a runner. Therefore, unit tests that run a pipeline require the Direct Runner. + + Reason: Java core does not depend on a runner. For unit test that executes a pipeline, it needs direct runner. + +To run integration tests, use the Direct Runner. + +## Run integration tests (*IT.java) + +Integration tests use [`TestPipeline`](https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java). Set options by using `TestPipelineOptions`. + +Integration tests differ from standard pipelines in the following ways: +* By default, they block on run (on TestDataflowRunner). +* They have default timeout of 15 minutes. +* The pipeline options are set in the system property `beamTestPipelineOptions`. + +Note the final different, because you need to configure the test by setting `-DbeamTestPipelineOptions=[...]`. This property is where you set the runner to use. + +The following example demonstrates how to run an integration test by using the command line. This example includes the options required to run the pipeline on the Dataflow runner: +``` +-DbeamTestPipelineOptions='["--runner=TestDataflowRunner","--project=mygcpproject","--region=us-central1","--stagingLocation=gs://mygcsbucket/path"]' +``` + +### Write an integration test + +* To set up a `TestPipeline` object in an integration test, use the following code: + ```java + + @Rule public TestPipeline pipelineWrite = TestPipeline.create(); + + @Test + public void testSomething() { + pipeline.apply(...); + + pipeline.run().waitUntilFinish(); + } + ``` + +* The task that runs the test needs to specify the runner. The following examples demonstrate how to specify the runner: + * To run a Google Cloud I/O integration test on the Direct Runner, use `:sdks:java:io:google-cloud-platform:integrationTest` + * To run integration tests on the standard Dataflow runner, use `:runners:google-cloud-dataflow-java:googleCloudPlatformLegacyWorkerIntegrationTest` + * To run integration test on Dataflow runner v2, use `:runners:google-cloud-dataflow-java:googleCloudPlatformRunnerV2IntegrationTest` + +To see how to run your workflow locally, refer to the Gradle command that the GHA workflow runs. Review Comment: GitHub Action, will add this abbrev at the first place it occurred -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
