jrmccluskey commented on code in PR #27415:
URL: https://github.com/apache/beam/pull/27415#discussion_r1258383385
##########
sdks/go/README.md:
##########
@@ -17,26 +17,87 @@
under the License.
-->
-# Go SDK
+# Go SDK Overview
The Apache Beam Go SDK is the Beam Model implemented in the [Go Programming
Language](https://go.dev/).
It is based on the following initial
[design](https://s.apache.org/beam-go-sdk-design-rfc).
+Below describes requirements, how to run examples, execute tests, and
contribute to the Go SDK.
-## How to run the examples
+_A note on Beam specific terminology used in this README._
-**Prerequisites**: to use Google Cloud sources and sinks (default for
-most examples), follow the setup
-[here](https://beam.apache.org/documentation/runners/dataflow/). You can
-verify that it works by running the corresponding Java example.
+_This README uses minimally necessary Beam related terminology to help you
determine requirements, usage and
+contribution to the Go SDK. A [definitions](#definitions) section below
provides short definitions for you to achieve
+these aims._
Review Comment:
I don't think this call-out is particularly helpful (see note on the
definitions section)
##########
sdks/go/README.md:
##########
@@ -84,56 +145,237 @@ sentence: 1
purse: 6
```
-To run wordcount on dataflow runner do:
+#### Run Example on the Dataflow Runner
+
+To run [examples/wordcount](examples/wordcount) on the [Dataflow
Runner](#dataflow-runner) run the following. See
+[pkg/beam/runners/dataflow/dataflow.go](pkg/beam/runners/dataflow/dataflow.go)
and
+[examples/wordcount/wordcount.go](examples/wordcount/wordcount.go) for a
descriptions of required and optional flags.
+
+1. Set your
+[Google Cloud
project](https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy#projects):
+ ```
+ GCP_PROJECT=$(gcloud config get-value project)
+ ```
+
+2. Set your [Google Cloud Compute
region](https://cloud.google.com/compute/docs/regions-zones):
+ ```
+ GCP_REGION=us-central1
+ ```
+
+3. Create and set your [Google Cloud Storage
Bucket](https://cloud.google.com/storage/docs/buckets):
+ (_Note this is WITHOUT the `gs://`_)
+ ```
+ GCS_BUCKET=<your-google-cloud-storage-bucket-name>
+ ```
+
+4. Run the word count example
+ (see
+ [Google Cloud
Documentation](https://cloud.google.com/dataflow/docs/quickstarts/create-pipeline-go#run_the_pipeline_on_the_service)
+ for more details):
+ ```
+ go run go/examples/wordcount/wordcount.go --runner=dataflow \
+ --sdk_container_image=apache/beam_go_sdk:latest \
+ --project=$GCP_PROJECT \
+ --region=$GCP_REGION \
+ --staging_location=gs://$GCS_BUCKET/staging \
+ --output=gs://$GCS_BUCKET/output
+ ```
+
+ You should see:
+ ```
+ 2023/07/09 10:57:13 Submitted job: <job-id>
+ 2023/07/09 10:57:13 Console:
https://console.cloud.google.com/dataflow/jobs/us-central1/<job-id>?project=<project>
+ 2023/07/09 11:02:56 Job state: JOB_STATE_PENDING ...
+ 2023/07/09 11:03:26 Job still running ...
+ ```
+
+5. After seeing `Job <job-id> succeeded!` you can inspect the resulting output.
+
+ Run the following command to inspect the resulting output.
+
+ ```
+ gcloud storage cat "gs://$GCS_BUCKET/output*" | head
+ ```
+
+ You should see something similar to the following:
+ ```
+ feature: 1
+ block: 1
+ Cried: 1
+ scatter'd: 1
+ she: 44
+ sudden: 1
+ silly: 1
+ More: 6
+ out: 68
+ believe: 3
+ ```
+
+#### Troubleshooting tips
+
+If you get the following error:
+```
+googleapi: Error 400: User project specified in the request is invalid.
+```
+
+Try the following:
+1. Make sure you followed the [Google Cloud Setup](#google-cloud-setup) above.
+2. Configure the [gcloud](https://cloud.google.com/sdk/docs/install-sdk) with
your project:
+ ```
+ gcloud config set project <your-project>
+ ```
+3. Consider re-running BOTH `gcloud auth login` and `gcloud auth
application-default login`
+ commands.
+
+## Testing
+
+### Requirements (For runner validations)
+
+Below lists additional requirements to execute tests in this repository on
your local machine.
+
+#### 1. Java and Python
+
+You **do not** need to know or care about Java or Python to use the Go SDK for
your data processing goals.
+
+Java is **only** required to execute any [gradle](https://gradle.org/)
commands configured in the
+[Beam](https://github.com/apache/beam) repository. It will be obvious whether
you will execute gradle commands in
+sections below. You **do not** need to install [gradle](https://gradle.org/)
and simply use the enclosed
+`$BEAM_ROOT/gradlew` executable available at the root of the
[Beam](https://github.com/apache/beam) repository.
+
+Python is **only** required if you need to validate tests against the
+[Portable Python Runner](#portable-python-runner). If you are not testing
against this
+runner, ignore the Python requirement.
+
+#### 2. Docker
+
+[Docker](https://www.docker.com/) is required for certain but not all runners
[See definition](#runner).
+
+As of this writing, [colima](https://github.com/abiosoft/colima), a preferred
docker alternative for some developers,
+did not work.
+
+#### 3. Flock
+
+[sdks/go/run_with_go_version.sh](run_with_go_version.sh) requires the use of
+[flock](https://github.com/discoteq/flock).
+
+### Execution
+
+#### 1. Navigate to the go.mod directory
+
+Open a terminal and navigate into the go.mod containing directory of the Beam
repository. (See above for what
+`$BEAM_ROOT` means). Notice that you are entering the `$BEAM_ROOT/sdks` and
not `$BEAM_ROOT/sdks/go`.
+
+```sh
+cd $BEAM_ROOT/sdks
+```
+
+#### 2. Run Go test
+
+Run go test as you would any Go project.
+For unit tests in the exported `pkg/beam` package:
```
-$ go run wordcount.go --runner=dataflow --project=<YOUR_GCP_PROJECT>
--region=<YOUR_GCP_REGION> --staging_location=<YOUR_GCS_LOCATION>/staging
--worker_harness_container_image=<YOUR_SDK_HARNESS_IMAGE_LOCATION>
--output=<YOUR_GCS_LOCATION>/output
+go test ./go/pkg/beam...
```
-The output is a GCS file in this case:
+For integration, load, and regression tests:
+```
+go test ./go/test/...
+```
+
+### Runner validations
+
+The following documents various [Runner](#runner) validation tests related to
test execution of the Go SDK
+**in this repository** (in contrast to your own Go SDK dependent repository
and projects).
+You'll see some documentation that deviates from Go test execution convention
to run gradle commands.
+
+As a Go developer, you might ask yourself, "Why gradle instead of 'go test' 😬?"
+Under the hood, various [Runner](#runner) architecture involves resources and
non-Go languages that Go SDK runner
+validations require. The cost of deviating from the `go test` convention,
comes with the benefit of allowing you
+as a Beam contributor to focus on the Go SDK and ignore the underlying
architecture specifics. The gradle commands
+conveniently automate and encapsulate these details.
+
+#### 1. Navigate to the Beam root directory
+
+Open a terminal and navigate into the root directory of the Beam repository.
(See [Beam Repository](#beam-repository)
+for what `$BEAM_ROOT` means).
+
+```sh
+cd $BEAM_ROOT
+```
+
+#### Execute tests
+
+To execute tests that validate against the [Dataflow Runner](#dataflow-runner):
+
+```
+./gradlew :sdks:go:test:dataflowValidatesRunner
+```
+
+To execute tests that validate against the [Flink Runner](#flink-runner):
+
+Note that the following command takes several minutes to run (~15min).
```
-$ gsutil cat <YOUR_GCS_LOCATION>/output* | head
-Blanket: 1
-blot: 1
-Kneeling: 3
-cautions: 1
-appears: 4
-Deserved: 1
-nettles: 1
-OSWALD: 53
-sport: 3
-Crown'd: 1
+./gradlew :sdks:go:test:flinkValidatesRunner
```
+To execute tests that validate against the [Portable Python
Runner](#portable-python-runner):
+
+```
+./gradlew :sdks:go:test:ulrValidatesRunner
+```
+
+# Build
See [BUILD.md](./BUILD.md) for how to build Go code in general. See
-[container
documentation](https://beam.apache.org/documentation/runtime/environments/#building-container-images)
for how to build and push the Go SDK harness container image.
+[container
documentation](https://beam.apache.org/documentation/runtime/environments/#building-container-images)
for how
+to build and push the Go SDK harness container image.
+
+# Issues
+
+Please use the
[`sdk-go`](https://github.com/apache/beam/issues?q=is%3Aopen+is%3Aissue+label%3Asdk-go)
component for any
+bugs or feature requests.
+
+# Contributing to the Go SDK
+
+See [contribution
guide](https://beam.apache.org/contribute/contribution-guide/#code) to create
branches, and submit
+pull requests as normal.
+
+# Definitions
+
+## Local Runner
+
+A [Runner](#runner) that runs pipeline code on your local machine, in contrast
to other runners such as
+the [Dataflow Runner](#dataflow-runner)
-## Issues
+## Dataflow Runner
-Please use the
[`sdk-go`](https://github.com/apache/beam/issues?q=is%3Aopen+is%3Aissue+label%3Asdk-go)
component for any bugs or feature requests.
+A [Runner](#runner) that runs pipeline code on
[Dataflow](https://cloud.google.com/dataflow).
-## Contributing to the Go SDK
+## Flink Runner
-### New to developing Go?
-https://tour.golang.org : The Go Tour gives you the basics of the language,
interactively no installation required.
+A [Runner](#runner) that runs pipeline code on the [Flink Runner]()
-https://github.com/campoy/go-tooling-workshop is a great start on learning
good (optional) development tools for Go.
+## Portable Python Runner
-### Developing Go Beam SDK on Github
+A [Runner](#runner) that runs pipeline code using the
+[Python Portable
Runner](../python/apache_beam/runners/portability/portable_runner.py)
-The Go SDK uses Go Modules for dependency management so it's as simple as
cloning
-the repo, making necessary changes and running tests.
+## Runner
-Executing all unit tests for the SDK is possible from the `<beam
root>\sdks\go` directory and running `go test ./...`.
+A Runner runs your pipeline code. You write your pipeline using the Go
programming language and execute on a Runner.
+See [examples](#run-examples) for how this looks in practice using the
`--runner` flag.
+[https://beam.apache.org/documentation/runners/capability-matrix](https://beam.apache.org/documentation/runners/capability-matrix)
+provides a list of available Beam Runners.
-To test your change as Jenkins would execute it from a PR, from the
-beam root directory, run:
- * `./gradlew :sdks:go:goTest` executes the unit tests.
- * `./gradlew :sdks:go:test:ulrValidatesRunner` validates the SDK against the
Portable Python runner.
- * `./gradlew :sdks:go:test:flinkValidatesRunner` validates the SDK against
the Flink runner.
+## Source
-Follow the [contribution
guide](https://beam.apache.org/contribute/contribution-guide/#code) to create
branches, and submit pull requests as normal.
+A resource from which you read data in a Beam pipeline. Examples include a
file directory from which you read
+files, a database from which you acquire data via a SQL query, or an API from
which you consume response
+payloads.
+## Sink
+A resource to which you write data in a Beam pipeline. Examples include a hard
disk directory to which you create and
+write file data, a database to which you insert or update records, or an API
to which you post remote call payloads.
Review Comment:
Is it really necessary to provide these definitions that are not go-specific
here? It feels like it would be better served in a definitions document at the
top level.
##########
sdks/go/README.md:
##########
@@ -84,56 +145,237 @@ sentence: 1
purse: 6
```
-To run wordcount on dataflow runner do:
+#### Run Example on the Dataflow Runner
+
+To run [examples/wordcount](examples/wordcount) on the [Dataflow
Runner](#dataflow-runner) run the following. See
+[pkg/beam/runners/dataflow/dataflow.go](pkg/beam/runners/dataflow/dataflow.go)
and
+[examples/wordcount/wordcount.go](examples/wordcount/wordcount.go) for a
descriptions of required and optional flags.
+
+1. Set your
+[Google Cloud
project](https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy#projects):
+ ```
+ GCP_PROJECT=$(gcloud config get-value project)
+ ```
+
+2. Set your [Google Cloud Compute
region](https://cloud.google.com/compute/docs/regions-zones):
+ ```
+ GCP_REGION=us-central1
+ ```
+
+3. Create and set your [Google Cloud Storage
Bucket](https://cloud.google.com/storage/docs/buckets):
+ (_Note this is WITHOUT the `gs://`_)
+ ```
+ GCS_BUCKET=<your-google-cloud-storage-bucket-name>
+ ```
+
+4. Run the word count example
+ (see
+ [Google Cloud
Documentation](https://cloud.google.com/dataflow/docs/quickstarts/create-pipeline-go#run_the_pipeline_on_the_service)
+ for more details):
+ ```
+ go run go/examples/wordcount/wordcount.go --runner=dataflow \
+ --sdk_container_image=apache/beam_go_sdk:latest \
+ --project=$GCP_PROJECT \
+ --region=$GCP_REGION \
+ --staging_location=gs://$GCS_BUCKET/staging \
+ --output=gs://$GCS_BUCKET/output
+ ```
+
+ You should see:
+ ```
+ 2023/07/09 10:57:13 Submitted job: <job-id>
+ 2023/07/09 10:57:13 Console:
https://console.cloud.google.com/dataflow/jobs/us-central1/<job-id>?project=<project>
+ 2023/07/09 11:02:56 Job state: JOB_STATE_PENDING ...
+ 2023/07/09 11:03:26 Job still running ...
+ ```
+
+5. After seeing `Job <job-id> succeeded!` you can inspect the resulting output.
+
+ Run the following command to inspect the resulting output.
+
+ ```
+ gcloud storage cat "gs://$GCS_BUCKET/output*" | head
+ ```
+
+ You should see something similar to the following:
+ ```
+ feature: 1
+ block: 1
+ Cried: 1
+ scatter'd: 1
+ she: 44
+ sudden: 1
+ silly: 1
+ More: 6
+ out: 68
+ believe: 3
+ ```
+
+#### Troubleshooting tips
+
+If you get the following error:
+```
+googleapi: Error 400: User project specified in the request is invalid.
+```
+
+Try the following:
+1. Make sure you followed the [Google Cloud Setup](#google-cloud-setup) above.
+2. Configure the [gcloud](https://cloud.google.com/sdk/docs/install-sdk) with
your project:
+ ```
+ gcloud config set project <your-project>
+ ```
+3. Consider re-running BOTH `gcloud auth login` and `gcloud auth
application-default login`
+ commands.
+
+## Testing
+
+### Requirements (For runner validations)
+
+Below lists additional requirements to execute tests in this repository on
your local machine.
+
+#### 1. Java and Python
+
+You **do not** need to know or care about Java or Python to use the Go SDK for
your data processing goals.
+
+Java is **only** required to execute any [gradle](https://gradle.org/)
commands configured in the
+[Beam](https://github.com/apache/beam) repository. It will be obvious whether
you will execute gradle commands in
+sections below. You **do not** need to install [gradle](https://gradle.org/)
and simply use the enclosed
+`$BEAM_ROOT/gradlew` executable available at the root of the
[Beam](https://github.com/apache/beam) repository.
+
+Python is **only** required if you need to validate tests against the
+[Portable Python Runner](#portable-python-runner). If you are not testing
against this
+runner, ignore the Python requirement.
+
+#### 2. Docker
+
+[Docker](https://www.docker.com/) is required for certain but not all runners
[See definition](#runner).
+
+As of this writing, [colima](https://github.com/abiosoft/colima), a preferred
docker alternative for some developers,
+did not work.
+
+#### 3. Flock
+
+[sdks/go/run_with_go_version.sh](run_with_go_version.sh) requires the use of
+[flock](https://github.com/discoteq/flock).
+
+### Execution
+
+#### 1. Navigate to the go.mod directory
+
+Open a terminal and navigate into the go.mod containing directory of the Beam
repository. (See above for what
+`$BEAM_ROOT` means). Notice that you are entering the `$BEAM_ROOT/sdks` and
not `$BEAM_ROOT/sdks/go`.
+
+```sh
+cd $BEAM_ROOT/sdks
+```
+
+#### 2. Run Go test
+
+Run go test as you would any Go project.
+For unit tests in the exported `pkg/beam` package:
```
-$ go run wordcount.go --runner=dataflow --project=<YOUR_GCP_PROJECT>
--region=<YOUR_GCP_REGION> --staging_location=<YOUR_GCS_LOCATION>/staging
--worker_harness_container_image=<YOUR_SDK_HARNESS_IMAGE_LOCATION>
--output=<YOUR_GCS_LOCATION>/output
+go test ./go/pkg/beam...
```
-The output is a GCS file in this case:
+For integration, load, and regression tests:
+```
+go test ./go/test/...
+```
+
+### Runner validations
+
+The following documents various [Runner](#runner) validation tests related to
test execution of the Go SDK
+**in this repository** (in contrast to your own Go SDK dependent repository
and projects).
+You'll see some documentation that deviates from Go test execution convention
to run gradle commands.
+
+As a Go developer, you might ask yourself, "Why gradle instead of 'go test' 😬?"
+Under the hood, various [Runner](#runner) architecture involves resources and
non-Go languages that Go SDK runner
+validations require. The cost of deviating from the `go test` convention,
comes with the benefit of allowing you
+as a Beam contributor to focus on the Go SDK and ignore the underlying
architecture specifics. The gradle commands
+conveniently automate and encapsulate these details.
Review Comment:
Feels like a tangent, I'd recommend cutting it.
Also not going to endorse emoji usage in the readme
##########
sdks/go/README.md:
##########
@@ -84,56 +145,237 @@ sentence: 1
purse: 6
```
-To run wordcount on dataflow runner do:
+#### Run Example on the Dataflow Runner
+
+To run [examples/wordcount](examples/wordcount) on the [Dataflow
Runner](#dataflow-runner) run the following. See
+[pkg/beam/runners/dataflow/dataflow.go](pkg/beam/runners/dataflow/dataflow.go)
and
+[examples/wordcount/wordcount.go](examples/wordcount/wordcount.go) for a
descriptions of required and optional flags.
+
+1. Set your
+[Google Cloud
project](https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy#projects):
+ ```
+ GCP_PROJECT=$(gcloud config get-value project)
+ ```
+
+2. Set your [Google Cloud Compute
region](https://cloud.google.com/compute/docs/regions-zones):
+ ```
+ GCP_REGION=us-central1
+ ```
+
+3. Create and set your [Google Cloud Storage
Bucket](https://cloud.google.com/storage/docs/buckets):
+ (_Note this is WITHOUT the `gs://`_)
+ ```
+ GCS_BUCKET=<your-google-cloud-storage-bucket-name>
+ ```
+
+4. Run the word count example
+ (see
+ [Google Cloud
Documentation](https://cloud.google.com/dataflow/docs/quickstarts/create-pipeline-go#run_the_pipeline_on_the_service)
+ for more details):
+ ```
+ go run go/examples/wordcount/wordcount.go --runner=dataflow \
+ --sdk_container_image=apache/beam_go_sdk:latest \
+ --project=$GCP_PROJECT \
+ --region=$GCP_REGION \
+ --staging_location=gs://$GCS_BUCKET/staging \
+ --output=gs://$GCS_BUCKET/output
+ ```
+
+ You should see:
+ ```
+ 2023/07/09 10:57:13 Submitted job: <job-id>
+ 2023/07/09 10:57:13 Console:
https://console.cloud.google.com/dataflow/jobs/us-central1/<job-id>?project=<project>
+ 2023/07/09 11:02:56 Job state: JOB_STATE_PENDING ...
+ 2023/07/09 11:03:26 Job still running ...
+ ```
+
+5. After seeing `Job <job-id> succeeded!` you can inspect the resulting output.
+
+ Run the following command to inspect the resulting output.
+
+ ```
+ gcloud storage cat "gs://$GCS_BUCKET/output*" | head
+ ```
+
+ You should see something similar to the following:
+ ```
+ feature: 1
+ block: 1
+ Cried: 1
+ scatter'd: 1
+ she: 44
+ sudden: 1
+ silly: 1
+ More: 6
+ out: 68
+ believe: 3
+ ```
+
+#### Troubleshooting tips
+
+If you get the following error:
+```
+googleapi: Error 400: User project specified in the request is invalid.
+```
+
+Try the following:
+1. Make sure you followed the [Google Cloud Setup](#google-cloud-setup) above.
+2. Configure the [gcloud](https://cloud.google.com/sdk/docs/install-sdk) with
your project:
+ ```
+ gcloud config set project <your-project>
+ ```
+3. Consider re-running BOTH `gcloud auth login` and `gcloud auth
application-default login`
+ commands.
+
+## Testing
+
+### Requirements (For runner validations)
+
+Below lists additional requirements to execute tests in this repository on
your local machine.
+
+#### 1. Java and Python
+
+You **do not** need to know or care about Java or Python to use the Go SDK for
your data processing goals.
+
+Java is **only** required to execute any [gradle](https://gradle.org/)
commands configured in the
+[Beam](https://github.com/apache/beam) repository. It will be obvious whether
you will execute gradle commands in
+sections below. You **do not** need to install [gradle](https://gradle.org/)
and simply use the enclosed
+`$BEAM_ROOT/gradlew` executable available at the root of the
[Beam](https://github.com/apache/beam) repository.
+
+Python is **only** required if you need to validate tests against the
+[Portable Python Runner](#portable-python-runner). If you are not testing
against this
+runner, ignore the Python requirement.
+
+#### 2. Docker
+
+[Docker](https://www.docker.com/) is required for certain but not all runners
[See definition](#runner).
+
+As of this writing, [colima](https://github.com/abiosoft/colima), a preferred
docker alternative for some developers,
+did not work.
Review Comment:
Unnecessary mention of colima here, we make no reference to it or promise to
support it anywhere so why bring it up?
##########
sdks/go/README.md:
##########
@@ -84,56 +145,237 @@ sentence: 1
purse: 6
```
-To run wordcount on dataflow runner do:
+#### Run Example on the Dataflow Runner
+
+To run [examples/wordcount](examples/wordcount) on the [Dataflow
Runner](#dataflow-runner) run the following. See
+[pkg/beam/runners/dataflow/dataflow.go](pkg/beam/runners/dataflow/dataflow.go)
and
+[examples/wordcount/wordcount.go](examples/wordcount/wordcount.go) for a
descriptions of required and optional flags.
+
+1. Set your
+[Google Cloud
project](https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy#projects):
+ ```
+ GCP_PROJECT=$(gcloud config get-value project)
+ ```
+
+2. Set your [Google Cloud Compute
region](https://cloud.google.com/compute/docs/regions-zones):
+ ```
+ GCP_REGION=us-central1
+ ```
+
+3. Create and set your [Google Cloud Storage
Bucket](https://cloud.google.com/storage/docs/buckets):
+ (_Note this is WITHOUT the `gs://`_)
+ ```
+ GCS_BUCKET=<your-google-cloud-storage-bucket-name>
+ ```
+
+4. Run the word count example
+ (see
+ [Google Cloud
Documentation](https://cloud.google.com/dataflow/docs/quickstarts/create-pipeline-go#run_the_pipeline_on_the_service)
+ for more details):
+ ```
+ go run go/examples/wordcount/wordcount.go --runner=dataflow \
+ --sdk_container_image=apache/beam_go_sdk:latest \
+ --project=$GCP_PROJECT \
+ --region=$GCP_REGION \
+ --staging_location=gs://$GCS_BUCKET/staging \
+ --output=gs://$GCS_BUCKET/output
+ ```
+
+ You should see:
+ ```
+ 2023/07/09 10:57:13 Submitted job: <job-id>
+ 2023/07/09 10:57:13 Console:
https://console.cloud.google.com/dataflow/jobs/us-central1/<job-id>?project=<project>
+ 2023/07/09 11:02:56 Job state: JOB_STATE_PENDING ...
+ 2023/07/09 11:03:26 Job still running ...
+ ```
+
+5. After seeing `Job <job-id> succeeded!` you can inspect the resulting output.
+
+ Run the following command to inspect the resulting output.
+
+ ```
+ gcloud storage cat "gs://$GCS_BUCKET/output*" | head
+ ```
+
+ You should see something similar to the following:
+ ```
+ feature: 1
+ block: 1
+ Cried: 1
+ scatter'd: 1
+ she: 44
+ sudden: 1
+ silly: 1
+ More: 6
+ out: 68
+ believe: 3
+ ```
+
+#### Troubleshooting tips
+
+If you get the following error:
+```
+googleapi: Error 400: User project specified in the request is invalid.
+```
+
+Try the following:
+1. Make sure you followed the [Google Cloud Setup](#google-cloud-setup) above.
+2. Configure the [gcloud](https://cloud.google.com/sdk/docs/install-sdk) with
your project:
+ ```
+ gcloud config set project <your-project>
+ ```
+3. Consider re-running BOTH `gcloud auth login` and `gcloud auth
application-default login`
+ commands.
+
Review Comment:
This is good information to drop into the wordcount example directory in a
markdown file, it feels a little extra to include in the readme
##########
sdks/go/README.md:
##########
@@ -84,56 +145,237 @@ sentence: 1
purse: 6
```
-To run wordcount on dataflow runner do:
+#### Run Example on the Dataflow Runner
+
+To run [examples/wordcount](examples/wordcount) on the [Dataflow
Runner](#dataflow-runner) run the following. See
+[pkg/beam/runners/dataflow/dataflow.go](pkg/beam/runners/dataflow/dataflow.go)
and
+[examples/wordcount/wordcount.go](examples/wordcount/wordcount.go) for a
descriptions of required and optional flags.
+
+1. Set your
+[Google Cloud
project](https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy#projects):
+ ```
+ GCP_PROJECT=$(gcloud config get-value project)
+ ```
+
+2. Set your [Google Cloud Compute
region](https://cloud.google.com/compute/docs/regions-zones):
+ ```
+ GCP_REGION=us-central1
+ ```
+
+3. Create and set your [Google Cloud Storage
Bucket](https://cloud.google.com/storage/docs/buckets):
+ (_Note this is WITHOUT the `gs://`_)
+ ```
+ GCS_BUCKET=<your-google-cloud-storage-bucket-name>
+ ```
+
+4. Run the word count example
+ (see
+ [Google Cloud
Documentation](https://cloud.google.com/dataflow/docs/quickstarts/create-pipeline-go#run_the_pipeline_on_the_service)
+ for more details):
+ ```
+ go run go/examples/wordcount/wordcount.go --runner=dataflow \
+ --sdk_container_image=apache/beam_go_sdk:latest \
+ --project=$GCP_PROJECT \
+ --region=$GCP_REGION \
+ --staging_location=gs://$GCS_BUCKET/staging \
+ --output=gs://$GCS_BUCKET/output
+ ```
+
+ You should see:
+ ```
+ 2023/07/09 10:57:13 Submitted job: <job-id>
+ 2023/07/09 10:57:13 Console:
https://console.cloud.google.com/dataflow/jobs/us-central1/<job-id>?project=<project>
+ 2023/07/09 11:02:56 Job state: JOB_STATE_PENDING ...
+ 2023/07/09 11:03:26 Job still running ...
+ ```
+
+5. After seeing `Job <job-id> succeeded!` you can inspect the resulting output.
+
+ Run the following command to inspect the resulting output.
+
+ ```
+ gcloud storage cat "gs://$GCS_BUCKET/output*" | head
+ ```
+
+ You should see something similar to the following:
+ ```
+ feature: 1
+ block: 1
+ Cried: 1
+ scatter'd: 1
+ she: 44
+ sudden: 1
+ silly: 1
+ More: 6
+ out: 68
+ believe: 3
+ ```
+
+#### Troubleshooting tips
+
+If you get the following error:
+```
+googleapi: Error 400: User project specified in the request is invalid.
+```
+
+Try the following:
+1. Make sure you followed the [Google Cloud Setup](#google-cloud-setup) above.
+2. Configure the [gcloud](https://cloud.google.com/sdk/docs/install-sdk) with
your project:
+ ```
+ gcloud config set project <your-project>
+ ```
+3. Consider re-running BOTH `gcloud auth login` and `gcloud auth
application-default login`
+ commands.
+
+## Testing
+
+### Requirements (For runner validations)
+
+Below lists additional requirements to execute tests in this repository on
your local machine.
+
+#### 1. Java and Python
+
+You **do not** need to know or care about Java or Python to use the Go SDK for
your data processing goals.
+
+Java is **only** required to execute any [gradle](https://gradle.org/)
commands configured in the
+[Beam](https://github.com/apache/beam) repository. It will be obvious whether
you will execute gradle commands in
+sections below. You **do not** need to install [gradle](https://gradle.org/)
and simply use the enclosed
+`$BEAM_ROOT/gradlew` executable available at the root of the
[Beam](https://github.com/apache/beam) repository.
+
+Python is **only** required if you need to validate tests against the
+[Portable Python Runner](#portable-python-runner). If you are not testing
against this
+runner, ignore the Python requirement.
+
+#### 2. Docker
+
+[Docker](https://www.docker.com/) is required for certain but not all runners
[See definition](#runner).
+
+As of this writing, [colima](https://github.com/abiosoft/colima), a preferred
docker alternative for some developers,
+did not work.
+
+#### 3. Flock
+
+[sdks/go/run_with_go_version.sh](run_with_go_version.sh) requires the use of
+[flock](https://github.com/discoteq/flock).
+
+### Execution
+
+#### 1. Navigate to the go.mod directory
+
+Open a terminal and navigate into the go.mod containing directory of the Beam
repository. (See above for what
+`$BEAM_ROOT` means). Notice that you are entering the `$BEAM_ROOT/sdks` and
not `$BEAM_ROOT/sdks/go`.
+
+```sh
+cd $BEAM_ROOT/sdks
+```
+
+#### 2. Run Go test
+
+Run go test as you would any Go project.
+For unit tests in the exported `pkg/beam` package:
```
-$ go run wordcount.go --runner=dataflow --project=<YOUR_GCP_PROJECT>
--region=<YOUR_GCP_REGION> --staging_location=<YOUR_GCS_LOCATION>/staging
--worker_harness_container_image=<YOUR_SDK_HARNESS_IMAGE_LOCATION>
--output=<YOUR_GCS_LOCATION>/output
+go test ./go/pkg/beam...
```
-The output is a GCS file in this case:
+For integration, load, and regression tests:
+```
+go test ./go/test/...
+```
+
+### Runner validations
+
+The following documents various [Runner](#runner) validation tests related to
test execution of the Go SDK
+**in this repository** (in contrast to your own Go SDK dependent repository
and projects).
+You'll see some documentation that deviates from Go test execution convention
to run gradle commands.
Review Comment:
```suggestion
You'll see some documentation that deviates from Go test execution
convention to run gradle commands; this allows for orchestration of more
complex testing scenarios with runners other than the Go Direct Runner.
```
##########
sdks/go/run_with_go_version.sh:
##########
@@ -83,11 +83,15 @@ if [ -z "$GOCMD" ] ; then
# The download command isn't concurrency safe so we get an exclusive lock,
without wait.
# If we're first, we ensure the command is downloaded, releasing the lock
afterwards.
# This operation is cached on system and won't be re-downloaded at least.
+ # The dependency on flock below is documented at sdks/go/README.md. If
this dependency is removed or changed,
+ # please make the corresponding changes at sdks/go/README.md.
flock --exclusive --nonblock --conflict-exit-code 0 $LOCKFILE
$GOBIN/$GOVERS download
# Execute the script with the remaining arguments.
# We get a shared lock for the ordinary go command execution.
echo $GOBIN/$GOVERS $@
+ # The dependency on flock below is documented at sdks/go/README.md. If
this dependency is removed or changed,
+ # please make the corresponding changes at sdks/go/README.md.
Review Comment:
No need to repeat this comment block twice.
##########
sdks/go/README.md:
##########
@@ -84,56 +145,237 @@ sentence: 1
purse: 6
```
-To run wordcount on dataflow runner do:
+#### Run Example on the Dataflow Runner
+
+To run [examples/wordcount](examples/wordcount) on the [Dataflow
Runner](#dataflow-runner) run the following. See
+[pkg/beam/runners/dataflow/dataflow.go](pkg/beam/runners/dataflow/dataflow.go)
and
+[examples/wordcount/wordcount.go](examples/wordcount/wordcount.go) for a
descriptions of required and optional flags.
+
+1. Set your
+[Google Cloud
project](https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy#projects):
+ ```
+ GCP_PROJECT=$(gcloud config get-value project)
+ ```
+
+2. Set your [Google Cloud Compute
region](https://cloud.google.com/compute/docs/regions-zones):
+ ```
+ GCP_REGION=us-central1
+ ```
+
+3. Create and set your [Google Cloud Storage
Bucket](https://cloud.google.com/storage/docs/buckets):
+ (_Note this is WITHOUT the `gs://`_)
+ ```
+ GCS_BUCKET=<your-google-cloud-storage-bucket-name>
+ ```
+
+4. Run the word count example
+ (see
+ [Google Cloud
Documentation](https://cloud.google.com/dataflow/docs/quickstarts/create-pipeline-go#run_the_pipeline_on_the_service)
+ for more details):
+ ```
+ go run go/examples/wordcount/wordcount.go --runner=dataflow \
+ --sdk_container_image=apache/beam_go_sdk:latest \
+ --project=$GCP_PROJECT \
+ --region=$GCP_REGION \
+ --staging_location=gs://$GCS_BUCKET/staging \
+ --output=gs://$GCS_BUCKET/output
+ ```
+
+ You should see:
+ ```
+ 2023/07/09 10:57:13 Submitted job: <job-id>
+ 2023/07/09 10:57:13 Console:
https://console.cloud.google.com/dataflow/jobs/us-central1/<job-id>?project=<project>
+ 2023/07/09 11:02:56 Job state: JOB_STATE_PENDING ...
+ 2023/07/09 11:03:26 Job still running ...
+ ```
+
+5. After seeing `Job <job-id> succeeded!` you can inspect the resulting output.
+
+ Run the following command to inspect the resulting output.
+
+ ```
+ gcloud storage cat "gs://$GCS_BUCKET/output*" | head
+ ```
+
+ You should see something similar to the following:
+ ```
+ feature: 1
+ block: 1
+ Cried: 1
+ scatter'd: 1
+ she: 44
+ sudden: 1
+ silly: 1
+ More: 6
+ out: 68
+ believe: 3
+ ```
+
+#### Troubleshooting tips
+
+If you get the following error:
+```
+googleapi: Error 400: User project specified in the request is invalid.
+```
+
+Try the following:
+1. Make sure you followed the [Google Cloud Setup](#google-cloud-setup) above.
+2. Configure the [gcloud](https://cloud.google.com/sdk/docs/install-sdk) with
your project:
+ ```
+ gcloud config set project <your-project>
+ ```
+3. Consider re-running BOTH `gcloud auth login` and `gcloud auth
application-default login`
+ commands.
+
+## Testing
+
+### Requirements (For runner validations)
+
+Below lists additional requirements to execute tests in this repository on
your local machine.
+
+#### 1. Java and Python
+
+You **do not** need to know or care about Java or Python to use the Go SDK for
your data processing goals.
+
+Java is **only** required to execute any [gradle](https://gradle.org/)
commands configured in the
+[Beam](https://github.com/apache/beam) repository. It will be obvious whether
you will execute gradle commands in
+sections below. You **do not** need to install [gradle](https://gradle.org/)
and simply use the enclosed
+`$BEAM_ROOT/gradlew` executable available at the root of the
[Beam](https://github.com/apache/beam) repository.
+
+Python is **only** required if you need to validate tests against the
+[Portable Python Runner](#portable-python-runner). If you are not testing
against this
+runner, ignore the Python requirement.
+
+#### 2. Docker
+
+[Docker](https://www.docker.com/) is required for certain but not all runners
[See definition](#runner).
+
+As of this writing, [colima](https://github.com/abiosoft/colima), a preferred
docker alternative for some developers,
+did not work.
+
+#### 3. Flock
+
+[sdks/go/run_with_go_version.sh](run_with_go_version.sh) requires the use of
+[flock](https://github.com/discoteq/flock).
+
+### Execution
+
+#### 1. Navigate to the go.mod directory
+
+Open a terminal and navigate into the go.mod containing directory of the Beam
repository. (See above for what
+`$BEAM_ROOT` means). Notice that you are entering the `$BEAM_ROOT/sdks` and
not `$BEAM_ROOT/sdks/go`.
+
+```sh
+cd $BEAM_ROOT/sdks
+```
+
+#### 2. Run Go test
+
+Run go test as you would any Go project.
+For unit tests in the exported `pkg/beam` package:
```
-$ go run wordcount.go --runner=dataflow --project=<YOUR_GCP_PROJECT>
--region=<YOUR_GCP_REGION> --staging_location=<YOUR_GCS_LOCATION>/staging
--worker_harness_container_image=<YOUR_SDK_HARNESS_IMAGE_LOCATION>
--output=<YOUR_GCS_LOCATION>/output
+go test ./go/pkg/beam...
```
-The output is a GCS file in this case:
+For integration, load, and regression tests:
+```
+go test ./go/test/...
+```
+
+### Runner validations
+
+The following documents various [Runner](#runner) validation tests related to
test execution of the Go SDK
+**in this repository** (in contrast to your own Go SDK dependent repository
and projects).
+You'll see some documentation that deviates from Go test execution convention
to run gradle commands.
+
+As a Go developer, you might ask yourself, "Why gradle instead of 'go test' 😬?"
+Under the hood, various [Runner](#runner) architecture involves resources and
non-Go languages that Go SDK runner
+validations require. The cost of deviating from the `go test` convention,
comes with the benefit of allowing you
+as a Beam contributor to focus on the Go SDK and ignore the underlying
architecture specifics. The gradle commands
+conveniently automate and encapsulate these details.
+
+#### 1. Navigate to the Beam root directory
+
+Open a terminal and navigate into the root directory of the Beam repository.
(See [Beam Repository](#beam-repository)
+for what `$BEAM_ROOT` means).
Review Comment:
```suggestion
Open a terminal and navigate into the root directory of the Beam repository.
```
Linking the section explaining $BEAM_ROOT feels unnecessary since you're
explained where they're navigating here in plain-text already. You can arguably
drop the `cd` command block too
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]