Elasticity and scalability for Spark in Kubernetes

Mich Talebzadeh Mon, 30 Oct 2023 04:19:14 -0700

I  was thinking in line of elasticity and autoscaling for Spark in the
context of Kubernetes. My experience with Kubernetes and Spark on the so
called autopilot has not been that great.This is mainly from my experience
that in autopilot you let the choice of nodes be decided by the vendor's
default configuration. Autopilot assumes that you can scale horizontally if
resource allocation is not there. However, this does not take into account,
if you start a k8s node of 4GB which is totally inadequate for a spark job
with moderate loads. Simply the driver pod fails to create and autopilot
starts building the cluster again. causing the delay. Sure it can start
with a larger node size and it might get there eventually at a considerable
delay.


Vertical elasticity refers to the ability of a single application instance
to scale its resources up or down. This can be done by adjusting the amount
of memory, CPU, or storage allocated to the application.

Horizontal autoscaling refers to the ability to automatically add or remove
application instances based on the workload. This is typically done by
monitoring the application's performance metrics, such as CPU utilization,
memory usage, or request latency.

Vertical elasticity


   - Memory: The amount of memory allocated to each Spark executor.
   - CPU: The number of CPU cores allocated to each Spark executor.
   - Storage: The amount of storage allocated to each Spark executor.


Horizontal autoscaling


   - Minimum number of executors: The minimum number of executors that
   should be running at any given time.
   - Maximum number of executors: The maximum number of executors that can
   be running at any given time.
   - Target CPU utilization: The desired CPU utilization for the cluster.
   - Target memory utilization: The desired memory utilization for the
   cluster.
   - Target request latency: The desired request latency for the
   application.


For example, in Python I would have these:


# Setting the horizontal autoscaling parameters

spark.conf.set('spark.dynamicAllocation.enabled', 'true') spark.conf.set(
'spark.dynamicAllocation.minExecutors', min_instances) spark.conf.set(
'spark.dynamicAllocation.maxExecutors', max_instances) spark.conf.set(
'spark.dynamicAllocation.targetExecutorIdleTime', 30) spark.conf.set(
'spark.dynamicAllocation.initialExecutors', 4)
spark.conf.set('spark.dynamicAllocation.targetRequestLatency', 100)

I have have also set the following properties, which are not strictly
necessary for horizontal autoscaling, but which can be helpful

   - target_memory_utilization: This property specifies the desired memory
   utilization for the application cluster.
   - target_request_latency: This property specifies the desired request
   latency for the application cluster.


spark.conf.set('target_request_latency '.100)
spark.conf.set('target_memory_utilization', 60)
Anyway this a sample of parameters that I use in spark-submit

spark-submit --verbose \
           --properties-file ${property_file} \
           --master k8s://https://$KUBERNETES_MASTER_IP:443 \
           --deploy-mode cluster \
           --name $APPNAME \
           --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \
           --conf spark.kubernetes.namespace=$NAMESPACE \
           --conf spark.network.timeout=300 \
           --conf spark.kubernetes.allocation.batch.size=3 \
           --conf spark.kubernetes.allocation.batch.delay=1 \
           --conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \
           --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \
           --conf spark.kubernetes.driver.pod.name=$APPNAME \
           --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
           --conf
spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
           --conf
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
\
           --conf spark.dynamicAllocation.enabled=true \
           --conf spark.dynamicAllocation.shuffleTracking.enabled=true \
           --conf spark.dynamicAllocation.shuffleTracking.timeout=20s \
           --conf spark.dynamicAllocation.executorIdleTimeout=30s \
           --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \
           --conf spark.dynamicAllocation.minExecutors=0 \
           --conf spark.dynamicAllocation.maxExecutors=20 \
           --conf spark.driver.cores=3 \
           --conf spark.executor.cores=3 \
           --conf spark.driver.memory=1024m \
           --conf spark.executor.memory=1024m \
           $CODE_DIRECTORY_CLOUD/${APPLICATION}

Note that I have kept the memory low (both the driver and executor) to move
the submit job from Pending to Running state. This is by no means optimum
but I like to explore ideas on it.
with the other members.

Thanks

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Elasticity and scalability for Spark in Kubernetes

Reply via email to