jonathaningram opened a new issue, #34343:
URL: https://github.com/apache/beam/issues/34343

   ### What happened?
   
   Beam version: at least v2.63.0.
   
   The `--yaml_pipeline` flag contains a string-like version of the pipeline. 
The `--yaml_pipeline_file` flag contains a path to the file.
   
   We can successfully use the `--yaml_pipeline_file` flag locally to run our 
YAML pipeline. As soon as we switch to `--yaml_pipeline`, it fails with an 
error. We tried both `--yaml-pipeline` and `--yaml-pipeline-file` flags from 
`gcloud dataflow yaml run`, and both seem to have the same issue.
   
   **Note: We haven't been able run any YAML pipeline with a Java provider 
successfully in Dataflow, so we're interested in the possibility of a patch 
being applied to Dataflow, or if there's a workaround that would be great.**
   
   <details>
   <summary>Stack trace</summary>
   
   ```
   <snip>
   INFO:apache_beam.yaml.yaml_transform:Expanding "Create" at line 4
   INFO:apache_beam.yaml.yaml_transform:Expanding "Identity" at line 18
   Traceback (most recent call last):
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", 
line 371, in create_ptransform
       ptransform = provider.create_transform(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_provider.py", 
line 192, in create_transform
       self._service = self._service()
                       ^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_provider.py", 
line 328, in <lambda>
       urns, lambda: external.JavaJarExpansionService(jar_provider()))
                                                      ^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_provider.py", 
line 260, in <lambda>
       urns, lambda: _join_url_or_filepath(provider_base_path, jar))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_provider.py", 
line 1282, in _join_url_or_filepath
       path_scheme = urllib.parse.urlparse(path, base_scheme).scheme
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.11/urllib/parse.py", line 395, in urlparse
       splitresult = urlsplit(url, scheme, allow_fragments)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.11/urllib/parse.py", line 478, in urlsplit
       scheme = scheme.strip(_WHATWG_C0_CONTROL_OR_SPACE)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   TypeError: a bytes-like object is required, not 'str'
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File "<frozen runpy>", line 198, in _run_module_as_main
     File "<frozen runpy>", line 88, in _run_code
     File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/main.py", 
line 154, in <module>
       run()
     File "/usr/local/lib/python3.11/site-packages/apache_beam/yaml/main.py", 
line 143, in run
       yaml_transform.expand_pipeline(
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", 
line 1077, in expand_pipeline
       providers or {})).expand(beam.pvalue.PBegin(pipeline))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", 
line 1042, in expand
       result = expand_transform(
                ^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", 
line 442, in expand_transform
       return expand_composite_transform(spec, scope)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", 
line 520, in expand_composite_transform
       return CompositePTransform.expand(None)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", 
line 508, in expand
       inner_scope.compute_all()
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", 
line 196, in compute_all
       self.compute_outputs(transform_id)
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", 
line 97, in wrapper
       self._cache[key] = func(self, *args)
                          ^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", 
line 232, in compute_outputs
       return expand_transform(self._transforms_by_uuid[transform_id], self)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", 
line 444, in expand_transform
       return expand_leaf_transform(spec, scope)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", 
line 466, in expand_leaf_transform
       ptransform = scope.create_ptransform(spec, inputs_dict.values())
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/apache_beam/yaml/yaml_transform.py", 
line 413, in create_ptransform
       raise ValueError(
   ValueError: Invalid transform specification at "Identity" at line 18: a 
bytes-like object is required, not 'str'
   Building pipeline...
   ```
   
   </details>
   
   I've made a repro here: 
https://github.com/jonathaningram/beam-starter-java-provider-repro which 
contains much of the same info as I've put in this ticket.
   
   The issue seems to be an encoding one.
   
   A possible patch that works locally, but I haven't verified how suitable the 
fix is, so I've not proposed a PR.
   
   Inside the `beam` repo:
   
   ```
   ➜  beam git:(v2.63.0) ✗ gb
   * (HEAD detached at sdks/v2.63.0)
     master
   ➜  beam git:(v2.63.0) ✗ gd
   diff --git a/sdks/python/apache_beam/yaml/yaml_provider.py 
b/sdks/python/apache_beam/yaml/yaml_provider.py
   index aa3c5d90515..f9d1bcf914c 100755
   --- a/sdks/python/apache_beam/yaml/yaml_provider.py
   +++ b/sdks/python/apache_beam/yaml/yaml_provider.py
   @@ -1279,7 +1279,7 @@ def _as_list(func):
   
    def _join_url_or_filepath(base, path):
      base_scheme = urllib.parse.urlparse(base, '').scheme
   -  path_scheme = urllib.parse.urlparse(path, base_scheme).scheme
   +  path_scheme = urllib.parse.urlparse(path.encode(), base_scheme).scheme
      if path_scheme != base_scheme:
        return path
      elif base_scheme and base_scheme in urllib.parse.uses_relative:
   ```
   
   You can mount the `beam` source code in the container in my repro and 
observe that it now works:
   
   ```
   docker run -v "$(pwd):/app" \
       -v 
"$BEAM_PYTHON_SRC:/usr/local/lib/python3.11/site-packages/apache_beam/yaml" \
       -v ~/.config/gcloud:/root/.config/gcloud \
       -w /app \
       --entrypoint /bin/bash beam_python3.11_sdk_with_java:2.63.0 \
       -c "python -m apache_beam.yaml.main --yaml_pipeline='$(yq -o=json '.' 
"$PIPELINE_FILE")' --runner=DataflowRunner"
   ```
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Infrastructure
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to