I was thinking in line of elasticity and autoscaling for Spark in the context of Kubernetes. My experience with Kubernetes and Spark on the so called autopilot has not been that great.This is mainly from my experience that in autopilot you let the choice of nodes be decided by the vendor's default configuration. Autopilot assumes that you can scale horizontally if resource allocation is not there. However, this does not take into account, if you start a k8s node of 4GB which is totally inadequate for a spark job with moderate loads. Simply the driver pod fails to create and autopilot starts building the cluster again. causing the delay. Sure it can start with a larger node size and it might get there eventually at a considerable delay.
Vertical elasticity refers to the ability of a single application instance to scale its resources up or down. This can be done by adjusting the amount of memory, CPU, or storage allocated to the application. Horizontal autoscaling refers to the ability to automatically add or remove application instances based on the workload. This is typically done by monitoring the application's performance metrics, such as CPU utilization, memory usage, or request latency. Vertical elasticity - Memory: The amount of memory allocated to each Spark executor. - CPU: The number of CPU cores allocated to each Spark executor. - Storage: The amount of storage allocated to each Spark executor. Horizontal autoscaling - Minimum number of executors: The minimum number of executors that should be running at any given time. - Maximum number of executors: The maximum number of executors that can be running at any given time. - Target CPU utilization: The desired CPU utilization for the cluster. - Target memory utilization: The desired memory utilization for the cluster. - Target request latency: The desired request latency for the application. For example, in Python I would have these: # Setting the horizontal autoscaling parameters spark.conf.set('spark.dynamicAllocation.enabled', 'true') spark.conf.set( 'spark.dynamicAllocation.minExecutors', min_instances) spark.conf.set( 'spark.dynamicAllocation.maxExecutors', max_instances) spark.conf.set( 'spark.dynamicAllocation.targetExecutorIdleTime', 30) spark.conf.set( 'spark.dynamicAllocation.initialExecutors', 4) spark.conf.set('spark.dynamicAllocation.targetRequestLatency', 100) I have have also set the following properties, which are not strictly necessary for horizontal autoscaling, but which can be helpful - target_memory_utilization: This property specifies the desired memory utilization for the application cluster. - target_request_latency: This property specifies the desired request latency for the application cluster. spark.conf.set('target_request_latency '.100) spark.conf.set('target_memory_utilization', 60) Anyway this a sample of parameters that I use in spark-submit spark-submit --verbose \ --properties-file ${property_file} \ --master k8s://https://$KUBERNETES_MASTER_IP:443 \ --deploy-mode cluster \ --name $APPNAME \ --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \ --conf spark.kubernetes.namespace=$NAMESPACE \ --conf spark.network.timeout=300 \ --conf spark.kubernetes.allocation.batch.size=3 \ --conf spark.kubernetes.allocation.batch.delay=1 \ --conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \ --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \ --conf spark.kubernetes.driver.pod.name=$APPNAME \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \ --conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \ --conf spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.shuffleTracking.enabled=true \ --conf spark.dynamicAllocation.shuffleTracking.timeout=20s \ --conf spark.dynamicAllocation.executorIdleTimeout=30s \ --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \ --conf spark.dynamicAllocation.minExecutors=0 \ --conf spark.dynamicAllocation.maxExecutors=20 \ --conf spark.driver.cores=3 \ --conf spark.executor.cores=3 \ --conf spark.driver.memory=1024m \ --conf spark.executor.memory=1024m \ $CODE_DIRECTORY_CLOUD/${APPLICATION} Note that I have kept the memory low (both the driver and executor) to move the submit job from Pending to Running state. This is by no means optimum but I like to explore ideas on it. with the other members. Thanks Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.