KevinGG commented on code in PR #23271:
URL: https://github.com/apache/beam/pull/23271#discussion_r973417483
##########
sdks/python/apache_beam/runners/interactive/dataproc/dataproc_cluster_manager.py:
##########
@@ -43,7 +45,33 @@ class UnimportedDataproc:
# Name of the log file auto-generated by Dataproc. We use it to locate the
# startup output of the Flink daemon to retrieve master url and dashboard
# information.
-DATAPROC_STAGING_LOG_NAME = 'dataproc-startup-script_output'
+DATAPROC_STAGING_LOG_NAME = 'dataproc-initialization-script-0_output'
+
+# Home dir of os user yarn.
+YARN_HOME = '/var/lib/hadoop-yarn'
+
+# Configures the os user yarn to use gcloud as the docker credHelper.
+# Also sets some taskmanager configurations for better parallelism.
+# Finally starts the yarn application: flink cluster in session mode.
+INIT_ACTION = """#!/bin/bash
+sudo -u yarn gcloud auth configure-docker --quiet
+
+readonly FLINK_INSTALL_DIR='/usr/lib/flink'
+readonly MASTER_HOSTNAME="$(/usr/share/google/get_metadata_value
attributes/dataproc-master)"
+
+cat <<EOF >>${FLINK_INSTALL_DIR}/conf/flink-conf.yaml
Review Comment:
I don't need to set those values and prefer to let Dataproc select the
appropriate values.
The values I set here are basically:
- taskmanager.memory.network.* (somehow Dataproc fixed all values to 64mb
and didn't honor the yarn defaults).
- taskmanager.network.numberOfBuffers: this should honor >= `#slots-per-TM^2
* #TMs * 4`. I temporarily increased the default value to support hundreds of
parallelism when there are hundreds of CPUs in the cluster. If needed, we shall
expose it for customization by end users in the future.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]