kamalmemon opened a new issue, #29037:
URL: https://github.com/apache/beam/issues/29037
### What happened?
SDK Version with issue: 2.51
SDK Version without issue: 2.50
I've encountered an issue with the extra_package option in PipelineOptions
when submitting a job using the DataflowRunner. The dependencies specified in
the extra_package option are not being installed on the Dataflow workers. This
behavior is specific to SDK version 2.51. The same code works as expected when
using SDK version 2.50.
**Details**
Example code:
```python
job_options = {
'job_name': "<REDACTED>",
'project': "<REDACTED>",
'runner': "DataflowRunner",
'temp_location': "<REDACTED>",
'region': "us-central1",
'subnetwork': "<REDACTED>",
'extra_package': 'tensorflow==2.13',
}
def run(output_path, query, project, feature_dtype_map, options):
opts = PipelineOptions(**options)
google_cloud_options = opts.view_as(GoogleCloudOptions)
google_cloud_options.labels = ['<REDACTED>']
with beam.Pipeline(options=opts) as p:
(
p | 'ReadFromBigQuery' >>
beam.io.Read(ReadFromBigQuery(query=query, use_standard_sql=True))
| 'ConvertToTFExample' >> beam.Map(<REDACTED>)
| 'WriteToTFRecord' >> <REDACTED>
)
```
**Observed Behavior**
When running with SDK version 2.51, the Dataflow workers fail with a
ModuleNotFoundError for tensorflow, suggesting that the tensorflow==2.13
package was not installed on the workers. No such error is observed with SDK
version 2.50.
**Expected Behavior**
The tensorflow==2.13 package should be installed on the Dataflow workers,
and the job should proceed without errors related to missing modules.
Would appreciate any assistance or workaround for this issue. Thank you!
### Issue Priority
Priority: 1 (data loss / total loss of function)
### Issue Components
- [X] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [X] Component: Google Cloud Dataflow Runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]