Re: [PR] [YAML] - Jinja % include example [beam]

via GitHub Thu, 21 Aug 2025 20:07:09 -0700


gemini-code-assist[bot] commented on code in PR #35914:
URL: https://github.com/apache/beam/pull/35914#discussion_r2292587296



##########
sdks/python/apache_beam/yaml/examples/testing/input_data.py:
##########
@@ -33,6 +33,36 @@ def text_data():
   ])
 
 
+def word_count_jinja_parameter_data():
+  return \
+    '{"readFromTextTransform": {"path": ' \
+    '"gs://dataflow-samples/shakespeare/kinglear.txt"}, ' \
+    '"mapToFieldsSplitConfig": {"language": "python", ' \
+    '"fields":{"value":"1"}}, ' \
+    '"explodeTransform":{"fields":"word"}, ' \
+    '"combineTransform":{"group_by":"word", "combine":{"value":"sum"}}, ' \
+    '"mapToFieldsCountConfig":{"language": "python", "fields":' \
+    '{"output": "word + \\" - \\" + str(value)"}}, ' \
+    '"writeToTextTransform":' \
+    '{"path":"gs://apache-beam-testing-derrickaw/wordCounts/"}}'

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   This long, concatenated string is hard to read and maintain. For better 
readability, you could construct a Python dictionary and then dump it to a JSON 
string using `json.dumps`. This would make the structure much clearer and 
easier to modify.
   
   Note: This change would require adding `import json` at the top of the file.
   
   ```python
   def word_count_jinja_parameter_data():
     params = {
         "readFromTextTransform": {
             "path": "gs://dataflow-samples/shakespeare/kinglear.txt"
         },
         "mapToFieldsSplitConfig": {
             "language": "python",
             "fields": {
                 "value": "1"
             }
         },
         "explodeTransform": {
             "fields": "word"
         },
         "combineTransform": {
             "group_by": "word",
             "combine": {
                 "value": "sum"
             }
         },
         "mapToFieldsCountConfig": {
             "language": "python",
             "fields": {
                 "output": "word + \" - \" + str(value)"
             }
         },
         "writeToTextTransform": {
             "path": "gs://apache-beam-testing-derrickaw/wordCounts/"
         }
     }
     return json.dumps(params)
   ```



##########
sdks/python/apache_beam/yaml/examples/transforms/jinja/include/submodules/combineTransform.yaml:
##########
@@ -0,0 +1,27 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#
+# The Combine transform groups by each word and count the occurrences.
+
+    - name: Count words
+      type: Combine
+      config:
+        group_by:
+          - {{combineTransform.group_by}}
+        combine:
+          value: {{combineTransform.combine.value}}

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   It's a common convention to have a newline at the end of a file. This file 
is missing one. Please consider adding it for consistency with other files in 
the project. This also applies to several other new YAML files in this pull 
request.
   
   ```yaml
             value: {{combineTransform.combine.value}}
   
   ```



##########
sdks/python/apache_beam/yaml/examples/transforms/jinja/include/wordCountInclude.yaml:
##########
@@ -0,0 +1,66 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This examples reads from a public file stored on Google Cloud. This
+# requires authenticating with Google Cloud, or setting the file in
+#`ReadFromText` to a local file.
+#
+# To set up Application Default Credentials,
+# see https://cloud.google.com/docs/authentication/external/set-up-adc.
+#
+# This pipeline reads in a text file, counts distinct words found in the text,
+# then logs a row containing each word and its count.
+
+pipeline:
+  type: chain
+  transforms:
+# Read in text file
+{% include 
'apache_beam/yaml/examples/transforms/jinja/include/submodules/readFromTextTransform.yaml'
 %}
+
+# Split words and count occurrences
+    - name: Split words
+      type: MapToFields
+      config:
+{% include 
'apache_beam/yaml/examples/transforms/jinja/include/submodules/mapToFieldsSplitConfig.yaml'
 %}
+
+# Explode into individual words
+{% include 
'apache_beam/yaml/examples/transforms/jinja/include/submodules/explodeTransform.yaml'
 %}
+
+# Group by word
+{% include 
'apache_beam/yaml/examples/transforms/jinja/include/submodules/combineTransform.yaml'
 %}
+
+# Format output to a single string consisting of `word - count`
+    - name: Format output
+      type: MapToFields
+      config:
+{% include 
'apache_beam/yaml/examples/transforms/jinja/include/submodules/mapToFieldsCountConfig.yaml'
 %}
+
+# Write to text file on GCS, locally, etc
+{% include 
'apache_beam/yaml/examples/transforms/jinja/include/submodules/writeToTextTransform.yaml'
 %}

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   The use of `{% include %}` at the beginning of the line makes this YAML file 
syntactically invalid before Jinja processing, and it's hard to understand the 
structure.
   
   For better readability and maintainability, I suggest indenting the `{% 
include %}` tags to match the expected YAML structure and removing the 
corresponding indentation from the included submodule files. This makes both 
the main pipeline file and the included snippets more intuitive and closer to 
valid YAML.
   
   For example, you could change this file to:
   ```yaml
   ...
     transforms:
       # Read in text file
       {% include 
'apache_beam/yaml/examples/transforms/jinja/include/submodules/readFromTextTransform.yaml'
 %}
   
       # Split words and count occurrences
       - name: Split words
         type: MapToFields
         config:
           {% include 
'apache_beam/yaml/examples/transforms/jinja/include/submodules/mapToFieldsSplitConfig.yaml'
 %}
   ...
   ```
   And then adjust the included files accordingly. For example 
`readFromTextTransform.yaml` would become:
   ```yaml
   - name: Read from GCS
     type: ReadFromText
     config:
       path: {{readFromTextTransform.path}}
   ```
   (with its leading indentation removed). This is a significant refactoring, 
but it would greatly improve the clarity of the example.



##########
sdks/python/apache_beam/yaml/examples/transforms/jinja/include/README.md:
##########
@@ -0,0 +1,63 @@
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+-->
+
+## Jinja % include Pipeline
+
+This example leverages the `% include` Jinja directive by having one main
+pipeline and then submodules for each transformed used.
+
+General setup:
+```sh
+export 
PIPELINE_FILE=apache_beam/yaml/examples/transforms/jinja/include/wordCountInclude.yaml
+export KINGLEAR="gs://dataflow-samples/shakespeare/kinglear.txt"
+export TEMP_LOCATION="gs://MY-BUCKET/wordCounts/"
+
+cd <PATH_TO_BEAM_REPO>/beam/sdks/python
+```
+
+Multiline Run Example:
+```sh
+python -m apache_beam.yaml.main \
+  --yaml_pipeline_file="${PIPELINE_FILE}" \
+  --jinja_variables='{
+    "readFromTextTransform": {"path": "'"${KINGLEAR}"'"},
+    "mapToFieldsSplitConfig": {
+      "language": "python",
+      "fields": {
+        "value": "1"
+      }
+    },
+    "explodeTransform": {"fields": "word"},
+    "combineTransform": {
+      "group_by": "word",
+      "combine": {"value": "sum"}
+    },
+    "mapToFieldsCountConfig": {
+      "language": "python",
+      "fields": {"output": "word + \" - \" + str(value)"}
+    },
+    "writeToTextTransform": {"path": "'"${TEMP_LOCATION}"'"}
+  }'
+```
+
+Single Line Run Example:
+```sh
+python -m apache_beam.yaml.main --yaml_pipeline_file="${PIPELINE_FILE}" 
--jinja_variables='{"readFromTextTransform": {"path": 
"gs://dataflow-samples/shakespeare/kinglear.txt"}, "mapToFieldsSplitConfig": 
{"language": "python", "fields":{"value":"1"}}, 
"explodeTransform":{"fields":"word"}, "combineTransform":{"group_by":"word", 
"combine":{"value":"sum"}}, "mapToFieldsCountConfig":{"language": "python", 
"fields":{"output":"word + \" - \" + str(value)"}}, 
"writeToTextTransform":{"path":"${TEMP_LOCATION}"}}'

Review Comment:
   ![high](https://www.gstatic.com/codereviewagent/high-priority.svg)
   
   In the single line run example, the `${TEMP_LOCATION}` variable is inside 
single quotes, which prevents shell expansion. This means it will be passed as 
a literal string instead of its value, causing the command to fail. To fix 
this, you should break out of the single quotes to allow for variable 
expansion, similar to how it's done in the multiline example.
   
   ```suggestion
   python -m apache_beam.yaml.main --yaml_pipeline_file="${PIPELINE_FILE}" 
--jinja_variables='{"readFromTextTransform": {"path": 
"gs://dataflow-samples/shakespeare/kinglear.txt"}, "mapToFieldsSplitConfig": 
{"language": "python", "fields":{"value":"1"}}, 
"explodeTransform":{"fields":"word"}, "combineTransform":{"group_by":"word", 
"combine":{"value":"sum"}}, "mapToFieldsCountConfig": {"language": "python", 
"fields":{"output":"word + \" - \" + str(value)"}}, 
"writeToTextTransform":{"path":"'"${TEMP_LOCATION}"'"}}'
   ```



##########
sdks/python/apache_beam/yaml/examples/testing/input_data.py:
##########
@@ -33,6 +33,36 @@
   ])
 
 
+def word_count_jinja_parameter_data():
+  return \
+    '{"readFromTextTransform": {"path": ' \
+    '"gs://dataflow-samples/shakespeare/kinglear.txt"}, ' \
+    '"mapToFieldsSplitConfig": {"language": "python", ' \
+    '"fields":{"value":"1"}}, ' \
+    '"explodeTransform":{"fields":"word"}, ' \
+    '"combineTransform":{"group_by":"word", "combine":{"value":"sum"}}, ' \
+    '"mapToFieldsCountConfig":{"language": "python", "fields":' \
+    '{"output": "word + \\" - \\" + str(value)"}}, ' \
+    '"writeToTextTransform":' \
+    '{"path":"gs://apache-beam-testing-derrickaw/wordCounts/"}}'
+
+
+def word_count_jinja_template_data():
+  return \
+[('apache_beam/yaml/examples/transforms/jinja/'
+    'include/submodules/readFromTextTransform.yaml'),
+   ('apache_beam/yaml/examples/transforms/jinja/'
+   'include/submodules/mapToFieldsSplitConfig.yaml'),
+   ('apache_beam/yaml/examples/transforms/jinja/'
+   'include/submodules/explodeTransform.yaml'),
+   ('apache_beam/yaml/examples/transforms/jinja/'
+   'include/submodules/combineTransform.yaml'),
+   ('apache_beam/yaml/examples/transforms/jinja/'
+   'include/submodules/mapToFieldsCountConfig.yaml'),
+   ('apache_beam/yaml/examples/transforms/jinja/'
+   'include/submodules/writeToTextTransform.yaml')]

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   The backslash for line continuation is unnecessary here, as Python 
automatically handles lists that span multiple lines. Removing it and 
formatting as a standard list would make the code cleaner and more idiomatic.
   
   ```python
   def word_count_jinja_template_data():
     return [
         
'apache_beam/yaml/examples/transforms/jinja/include/submodules/readFromTextTransform.yaml',
         
'apache_beam/yaml/examples/transforms/jinja/include/submodules/mapToFieldsSplitConfig.yaml',
         
'apache_beam/yaml/examples/transforms/jinja/include/submodules/explodeTransform.yaml',
         
'apache_beam/yaml/examples/transforms/jinja/include/submodules/combineTransform.yaml',
         
'apache_beam/yaml/examples/transforms/jinja/include/submodules/mapToFieldsCountConfig.yaml',
         
'apache_beam/yaml/examples/transforms/jinja/include/submodules/writeToTextTransform.yaml'
     ]
   ```



##########
sdks/python/apache_beam/yaml/examples/testing/examples_test.py:
##########
@@ -339,8 +343,21 @@ def test_yaml_example(self):
     for i, line in enumerate(expected):
       expected[i] = line.replace('#  ', '').replace('\n', '')
     expected = [line for line in expected if line]
+
+    raw_spec_string = ''.join(lines)
+    # Filter for any jinja preprocessor - this has to be done before other
+    # preprocessors.
+    jinja_preprocessor = [
+        preprocessor for preprocessor in custom_preprocessors
+        if 'jinja_preprocessor' in preprocessor.__name__
+    ]
+    if jinja_preprocessor:
+      jinja_preprocessor = jinja_preprocessor[0]
+      raw_spec_string = jinja_preprocessor(raw_spec_string)
+      custom_preprocessors.remove(jinja_preprocessor)

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   This logic assumes that at most one `jinja_preprocessor` is found. If 
multiple preprocessors match the condition, it will silently use the first one, 
which might hide issues or lead to unexpected behavior. It would be more robust 
to ensure that only one such preprocessor exists by adding an assertion.
   
   ```suggestion
       if jinja_preprocessor:
         assert len(jinja_preprocessor) == 1, "Expected at most one jinja 
preprocessor."
         jinja_preprocessor = jinja_preprocessor[0]
         raw_spec_string = jinja_preprocessor(raw_spec_string)
         custom_preprocessors.remove(jinja_preprocessor)
   ```



##########
sdks/python/apache_beam/yaml/examples/testing/examples_test.py:
##########
@@ -538,8 +558,64 @@
     word = element.split('=')[1].split(',')[0].replace("'", '')
     count = int(element.split('=')[2].replace(')', ''))
     all_words += [word] * count
-  random.shuffle(all_words)
 
+  return _wordcount_random_shuffler(test_spec, all_words, env)
+
+
+@YamlExamplesTestSuite.register_test_preprocessor(
+    ['test_wordCountInclude_yaml'])
+def _wordcount_test_preprocessor(

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   The function `_wordcount_test_preprocessor` was recently renamed to 
`_wordcount_minimal_test_preprocessor` for clarity. Introducing a new function 
with the old name can be confusing. Consider renaming this function to be more 
specific, for example, `_wordcount_jinja_test_preprocessor`, to reflect its 
purpose and avoid ambiguity.
   
   ```suggestion
   def _wordcount_jinja_test_preprocessor(
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [YAML] - Jinja % include example [beam]

Reply via email to