gemini-code-assist[bot] commented on code in PR #35914: URL: https://github.com/apache/beam/pull/35914#discussion_r2292587296
########## sdks/python/apache_beam/yaml/examples/testing/input_data.py: ########## @@ -33,6 +33,36 @@ def text_data(): ]) +def word_count_jinja_parameter_data(): + return \ + '{"readFromTextTransform": {"path": ' \ + '"gs://dataflow-samples/shakespeare/kinglear.txt"}, ' \ + '"mapToFieldsSplitConfig": {"language": "python", ' \ + '"fields":{"value":"1"}}, ' \ + '"explodeTransform":{"fields":"word"}, ' \ + '"combineTransform":{"group_by":"word", "combine":{"value":"sum"}}, ' \ + '"mapToFieldsCountConfig":{"language": "python", "fields":' \ + '{"output": "word + \\" - \\" + str(value)"}}, ' \ + '"writeToTextTransform":' \ + '{"path":"gs://apache-beam-testing-derrickaw/wordCounts/"}}' Review Comment:  This long, concatenated string is hard to read and maintain. For better readability, you could construct a Python dictionary and then dump it to a JSON string using `json.dumps`. This would make the structure much clearer and easier to modify. Note: This change would require adding `import json` at the top of the file. ```python def word_count_jinja_parameter_data(): params = { "readFromTextTransform": { "path": "gs://dataflow-samples/shakespeare/kinglear.txt" }, "mapToFieldsSplitConfig": { "language": "python", "fields": { "value": "1" } }, "explodeTransform": { "fields": "word" }, "combineTransform": { "group_by": "word", "combine": { "value": "sum" } }, "mapToFieldsCountConfig": { "language": "python", "fields": { "output": "word + \" - \" + str(value)" } }, "writeToTextTransform": { "path": "gs://apache-beam-testing-derrickaw/wordCounts/" } } return json.dumps(params) ``` ########## sdks/python/apache_beam/yaml/examples/transforms/jinja/include/submodules/combineTransform.yaml: ########## @@ -0,0 +1,27 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# +# The Combine transform groups by each word and count the occurrences. + + - name: Count words + type: Combine + config: + group_by: + - {{combineTransform.group_by}} + combine: + value: {{combineTransform.combine.value}} Review Comment:  It's a common convention to have a newline at the end of a file. This file is missing one. Please consider adding it for consistency with other files in the project. This also applies to several other new YAML files in this pull request. ```yaml value: {{combineTransform.combine.value}} ``` ########## sdks/python/apache_beam/yaml/examples/transforms/jinja/include/wordCountInclude.yaml: ########## @@ -0,0 +1,66 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# This examples reads from a public file stored on Google Cloud. This +# requires authenticating with Google Cloud, or setting the file in +#`ReadFromText` to a local file. +# +# To set up Application Default Credentials, +# see https://cloud.google.com/docs/authentication/external/set-up-adc. +# +# This pipeline reads in a text file, counts distinct words found in the text, +# then logs a row containing each word and its count. + +pipeline: + type: chain + transforms: +# Read in text file +{% include 'apache_beam/yaml/examples/transforms/jinja/include/submodules/readFromTextTransform.yaml' %} + +# Split words and count occurrences + - name: Split words + type: MapToFields + config: +{% include 'apache_beam/yaml/examples/transforms/jinja/include/submodules/mapToFieldsSplitConfig.yaml' %} + +# Explode into individual words +{% include 'apache_beam/yaml/examples/transforms/jinja/include/submodules/explodeTransform.yaml' %} + +# Group by word +{% include 'apache_beam/yaml/examples/transforms/jinja/include/submodules/combineTransform.yaml' %} + +# Format output to a single string consisting of `word - count` + - name: Format output + type: MapToFields + config: +{% include 'apache_beam/yaml/examples/transforms/jinja/include/submodules/mapToFieldsCountConfig.yaml' %} + +# Write to text file on GCS, locally, etc +{% include 'apache_beam/yaml/examples/transforms/jinja/include/submodules/writeToTextTransform.yaml' %} Review Comment:  The use of `{% include %}` at the beginning of the line makes this YAML file syntactically invalid before Jinja processing, and it's hard to understand the structure. For better readability and maintainability, I suggest indenting the `{% include %}` tags to match the expected YAML structure and removing the corresponding indentation from the included submodule files. This makes both the main pipeline file and the included snippets more intuitive and closer to valid YAML. For example, you could change this file to: ```yaml ... transforms: # Read in text file {% include 'apache_beam/yaml/examples/transforms/jinja/include/submodules/readFromTextTransform.yaml' %} # Split words and count occurrences - name: Split words type: MapToFields config: {% include 'apache_beam/yaml/examples/transforms/jinja/include/submodules/mapToFieldsSplitConfig.yaml' %} ... ``` And then adjust the included files accordingly. For example `readFromTextTransform.yaml` would become: ```yaml - name: Read from GCS type: ReadFromText config: path: {{readFromTextTransform.path}} ``` (with its leading indentation removed). This is a significant refactoring, but it would greatly improve the clarity of the example. ########## sdks/python/apache_beam/yaml/examples/transforms/jinja/include/README.md: ########## @@ -0,0 +1,63 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +## Jinja % include Pipeline + +This example leverages the `% include` Jinja directive by having one main +pipeline and then submodules for each transformed used. + +General setup: +```sh +export PIPELINE_FILE=apache_beam/yaml/examples/transforms/jinja/include/wordCountInclude.yaml +export KINGLEAR="gs://dataflow-samples/shakespeare/kinglear.txt" +export TEMP_LOCATION="gs://MY-BUCKET/wordCounts/" + +cd <PATH_TO_BEAM_REPO>/beam/sdks/python +``` + +Multiline Run Example: +```sh +python -m apache_beam.yaml.main \ + --yaml_pipeline_file="${PIPELINE_FILE}" \ + --jinja_variables='{ + "readFromTextTransform": {"path": "'"${KINGLEAR}"'"}, + "mapToFieldsSplitConfig": { + "language": "python", + "fields": { + "value": "1" + } + }, + "explodeTransform": {"fields": "word"}, + "combineTransform": { + "group_by": "word", + "combine": {"value": "sum"} + }, + "mapToFieldsCountConfig": { + "language": "python", + "fields": {"output": "word + \" - \" + str(value)"} + }, + "writeToTextTransform": {"path": "'"${TEMP_LOCATION}"'"} + }' +``` + +Single Line Run Example: +```sh +python -m apache_beam.yaml.main --yaml_pipeline_file="${PIPELINE_FILE}" --jinja_variables='{"readFromTextTransform": {"path": "gs://dataflow-samples/shakespeare/kinglear.txt"}, "mapToFieldsSplitConfig": {"language": "python", "fields":{"value":"1"}}, "explodeTransform":{"fields":"word"}, "combineTransform":{"group_by":"word", "combine":{"value":"sum"}}, "mapToFieldsCountConfig":{"language": "python", "fields":{"output":"word + \" - \" + str(value)"}}, "writeToTextTransform":{"path":"${TEMP_LOCATION}"}}' Review Comment:  In the single line run example, the `${TEMP_LOCATION}` variable is inside single quotes, which prevents shell expansion. This means it will be passed as a literal string instead of its value, causing the command to fail. To fix this, you should break out of the single quotes to allow for variable expansion, similar to how it's done in the multiline example. ```suggestion python -m apache_beam.yaml.main --yaml_pipeline_file="${PIPELINE_FILE}" --jinja_variables='{"readFromTextTransform": {"path": "gs://dataflow-samples/shakespeare/kinglear.txt"}, "mapToFieldsSplitConfig": {"language": "python", "fields":{"value":"1"}}, "explodeTransform":{"fields":"word"}, "combineTransform":{"group_by":"word", "combine":{"value":"sum"}}, "mapToFieldsCountConfig": {"language": "python", "fields":{"output":"word + \" - \" + str(value)"}}, "writeToTextTransform":{"path":"'"${TEMP_LOCATION}"'"}}' ``` ########## sdks/python/apache_beam/yaml/examples/testing/input_data.py: ########## @@ -33,6 +33,36 @@ ]) +def word_count_jinja_parameter_data(): + return \ + '{"readFromTextTransform": {"path": ' \ + '"gs://dataflow-samples/shakespeare/kinglear.txt"}, ' \ + '"mapToFieldsSplitConfig": {"language": "python", ' \ + '"fields":{"value":"1"}}, ' \ + '"explodeTransform":{"fields":"word"}, ' \ + '"combineTransform":{"group_by":"word", "combine":{"value":"sum"}}, ' \ + '"mapToFieldsCountConfig":{"language": "python", "fields":' \ + '{"output": "word + \\" - \\" + str(value)"}}, ' \ + '"writeToTextTransform":' \ + '{"path":"gs://apache-beam-testing-derrickaw/wordCounts/"}}' + + +def word_count_jinja_template_data(): + return \ +[('apache_beam/yaml/examples/transforms/jinja/' + 'include/submodules/readFromTextTransform.yaml'), + ('apache_beam/yaml/examples/transforms/jinja/' + 'include/submodules/mapToFieldsSplitConfig.yaml'), + ('apache_beam/yaml/examples/transforms/jinja/' + 'include/submodules/explodeTransform.yaml'), + ('apache_beam/yaml/examples/transforms/jinja/' + 'include/submodules/combineTransform.yaml'), + ('apache_beam/yaml/examples/transforms/jinja/' + 'include/submodules/mapToFieldsCountConfig.yaml'), + ('apache_beam/yaml/examples/transforms/jinja/' + 'include/submodules/writeToTextTransform.yaml')] Review Comment:  The backslash for line continuation is unnecessary here, as Python automatically handles lists that span multiple lines. Removing it and formatting as a standard list would make the code cleaner and more idiomatic. ```python def word_count_jinja_template_data(): return [ 'apache_beam/yaml/examples/transforms/jinja/include/submodules/readFromTextTransform.yaml', 'apache_beam/yaml/examples/transforms/jinja/include/submodules/mapToFieldsSplitConfig.yaml', 'apache_beam/yaml/examples/transforms/jinja/include/submodules/explodeTransform.yaml', 'apache_beam/yaml/examples/transforms/jinja/include/submodules/combineTransform.yaml', 'apache_beam/yaml/examples/transforms/jinja/include/submodules/mapToFieldsCountConfig.yaml', 'apache_beam/yaml/examples/transforms/jinja/include/submodules/writeToTextTransform.yaml' ] ``` ########## sdks/python/apache_beam/yaml/examples/testing/examples_test.py: ########## @@ -339,8 +343,21 @@ def test_yaml_example(self): for i, line in enumerate(expected): expected[i] = line.replace('# ', '').replace('\n', '') expected = [line for line in expected if line] + + raw_spec_string = ''.join(lines) + # Filter for any jinja preprocessor - this has to be done before other + # preprocessors. + jinja_preprocessor = [ + preprocessor for preprocessor in custom_preprocessors + if 'jinja_preprocessor' in preprocessor.__name__ + ] + if jinja_preprocessor: + jinja_preprocessor = jinja_preprocessor[0] + raw_spec_string = jinja_preprocessor(raw_spec_string) + custom_preprocessors.remove(jinja_preprocessor) Review Comment:  This logic assumes that at most one `jinja_preprocessor` is found. If multiple preprocessors match the condition, it will silently use the first one, which might hide issues or lead to unexpected behavior. It would be more robust to ensure that only one such preprocessor exists by adding an assertion. ```suggestion if jinja_preprocessor: assert len(jinja_preprocessor) == 1, "Expected at most one jinja preprocessor." jinja_preprocessor = jinja_preprocessor[0] raw_spec_string = jinja_preprocessor(raw_spec_string) custom_preprocessors.remove(jinja_preprocessor) ``` ########## sdks/python/apache_beam/yaml/examples/testing/examples_test.py: ########## @@ -538,8 +558,64 @@ word = element.split('=')[1].split(',')[0].replace("'", '') count = int(element.split('=')[2].replace(')', '')) all_words += [word] * count - random.shuffle(all_words) + return _wordcount_random_shuffler(test_spec, all_words, env) + + +@YamlExamplesTestSuite.register_test_preprocessor( + ['test_wordCountInclude_yaml']) +def _wordcount_test_preprocessor( Review Comment:  The function `_wordcount_test_preprocessor` was recently renamed to `_wordcount_minimal_test_preprocessor` for clarity. Introducing a new function with the old name can be confusing. Consider renaming this function to be more specific, for example, `_wordcount_jinja_test_preprocessor`, to reflect its purpose and avoid ambiguity. ```suggestion def _wordcount_jinja_test_preprocessor( ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org