damccorm commented on a change in pull request #17189: URL: https://github.com/apache/beam/pull/17189#discussion_r835806315
########## File path: sdks/python/apache_beam/runners/dataflow/dataflow_job_service.py ########## @@ -0,0 +1,100 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import argparse +import logging +import sys +import time +import traceback + +from apache_beam.options import pipeline_options +from apache_beam.portability.api import beam_artifact_api_pb2_grpc +from apache_beam.portability.api import beam_runner_api_pb2 +from apache_beam.portability.api import beam_fn_api_pb2_grpc +from apache_beam.portability.api import beam_job_api_pb2 +from apache_beam.portability.api import beam_job_api_pb2_grpc +from apache_beam.portability.api import beam_provision_api_pb2 +from apache_beam.portability.api import endpoints_pb2 +from apache_beam.runners.dataflow import dataflow_runner +from apache_beam.runners.job import utils as job_utils +from apache_beam.runners.portability import abstract_job_service +from apache_beam.runners.portability import local_job_service +from apache_beam.runners.portability import local_job_service_main +from apache_beam.runners.portability import portable_runner +from apache_beam.runners.portability.fn_api_runner import fn_runner + +_LOGGER = logging.getLogger(__name__) + + +class DataflowBeamJob(local_job_service.BeamJob): Review comment: It would probably be good to doc this class and maybe specifically `_invoke_runner` as well (even though its private) since anyone who is interested in inserting some proxy specific logic _probably_ wants to do so here. ########## File path: sdks/python/apache_beam/options/pipeline_options.py ########## @@ -1210,7 +1210,7 @@ def _add_argparse_args(cls, parser): 'and port, e.g. localhost:8098. If none is specified, the ' 'artifact endpoint sent from the job server is used.')) parser.add_argument( - '--job-server-timeout', + '--job_server_timeout', Review comment: I think this is the right change, but just wanted to call out that I think this is sorta breaking since anyone who is currently setting a job-server-timeout will now have their pipeline throw an invalid argument error. With that said, that's probably better than the current behavior of silently not setting the timeout 😅 ########## File path: sdks/python/apache_beam/runners/dataflow/dataflow_job_service.py ########## @@ -0,0 +1,100 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import argparse +import logging +import sys +import time +import traceback + +from apache_beam.options import pipeline_options +from apache_beam.portability.api import beam_artifact_api_pb2_grpc +from apache_beam.portability.api import beam_runner_api_pb2 +from apache_beam.portability.api import beam_fn_api_pb2_grpc +from apache_beam.portability.api import beam_job_api_pb2 +from apache_beam.portability.api import beam_job_api_pb2_grpc +from apache_beam.portability.api import beam_provision_api_pb2 +from apache_beam.portability.api import endpoints_pb2 +from apache_beam.runners.dataflow import dataflow_runner +from apache_beam.runners.job import utils as job_utils +from apache_beam.runners.portability import abstract_job_service +from apache_beam.runners.portability import local_job_service +from apache_beam.runners.portability import local_job_service_main +from apache_beam.runners.portability import portable_runner +from apache_beam.runners.portability.fn_api_runner import fn_runner + +_LOGGER = logging.getLogger(__name__) Review comment: Nit: Do we actually use this? To be even nittier - same question about some of the imports (specifically fn_runner, abstract_job_service, and all the generated proto/grpc imports) ########## File path: sdks/python/apache_beam/runners/dataflow/dataflow_job_service.py ########## @@ -0,0 +1,100 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import argparse +import logging +import sys +import time +import traceback + +from apache_beam.options import pipeline_options +from apache_beam.portability.api import beam_artifact_api_pb2_grpc +from apache_beam.portability.api import beam_runner_api_pb2 +from apache_beam.portability.api import beam_fn_api_pb2_grpc +from apache_beam.portability.api import beam_job_api_pb2 +from apache_beam.portability.api import beam_job_api_pb2_grpc +from apache_beam.portability.api import beam_provision_api_pb2 +from apache_beam.portability.api import endpoints_pb2 +from apache_beam.runners.dataflow import dataflow_runner +from apache_beam.runners.job import utils as job_utils +from apache_beam.runners.portability import abstract_job_service +from apache_beam.runners.portability import local_job_service +from apache_beam.runners.portability import local_job_service_main +from apache_beam.runners.portability import portable_runner +from apache_beam.runners.portability.fn_api_runner import fn_runner + +_LOGGER = logging.getLogger(__name__) + + +class DataflowBeamJob(local_job_service.BeamJob): Review comment: In fact, I also wonder if we can make the experience even a little easier for that class of customer by pulling more of `run` out into its own class. If we defined run as: ``` class DataflowJobService: def run(self, argv, beam_job_type): # Same as current implementation, but with beam_job_type replacing DataflowBeamJob ``` then their implementation could be as simple as: ``` class ExtraSecureDataflowBeamJob(dataflow_job_service.DataflowBeamJob): def _invoke_runner(self): super()._invoke_runner(self) if __name__ == '__main__': logging.basicConfig() logging.getLogger().setLevel(logging.INFO) job_service = dataflow_job_service.DataflowJobService() job_service.run(sys.argv, ExtraSecureDataflowBeamJob) ``` and they could easily take Beam upgrades without worrying about anything getting stomped on. Thoughts? ########## File path: sdks/python/apache_beam/runners/dataflow/dataflow_runner.py ########## @@ -555,13 +566,15 @@ def run_pipeline(self, pipeline, options): self.job = apiclient.Job(options, self.proto_pipeline) - # Dataflow Runner v1 requires output type of the Flatten to be the same as - # the inputs, hence we enforce that here. Dataflow Runner v2 does not - # require this. - pipeline.visit(self.flatten_input_visitor()) + # TODO: Consider skipping these for all use_portable_job_submission jobs. Review comment: > TODO: Consider skipping these for all use_portable_job_submission jobs. Isn't that effectively what we're doing here? Are there existing use cases for dataflow + use_portable_job_submission? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
