Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

Mich Talebzadeh Fri, 11 Feb 2022 12:35:05 -0800

The equivalent of Google GKE autopilot
<https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview>
in
AWS is AWS Fargate <https://aws.amazon.com/fargate/>



I have not used the AWS Fargate so I can only mension Google's GKE
Autopilot.


This is developed from the concept of containerization and microservices.
In the standard mode of creating a GKE cluster users can customize their
configurations based on the requirements, GKE manages the control plane and
users manually provision and manage their node infrastructure. So you
choose your hardware type and memory/CPU where your spark containers will
be running and they will be shown as VM hosts in your account. In GKE
Autopilot mode, GKE manages the nodes, pre-configures the cluster with
adds-on for auto-scaling, auto-upgrades, maintenance, Day 2 operations and
security hardening. So there is a lot there. You don't choose your nodes
and their sizes. You are effectively paying for the pods you use.


Within spark-submit, you still need to specify the number of executors,
driver and executor memory plus cores for each driver and executor when
doing spark-submit. The theory is that the k8s cluster will deploy suitable
nodes and will create enough pods on those nodes. With the standard k8s
cluster you choose your nodes and you ensure that one core on each node is
reserved for the OS itself. Otherwise if you allocate all cores to spark
with --conf spark.executor.cores, you will receive this error


kubctl describe pods -n spark

...

Events:

  Type     Reason             Age                 From
Message

  ----     ------             ----                ----
-------

  Warning  FailedScheduling   9s (x17 over 15m)   default-scheduler   0/3
nodes are available: 3 Insufficient cpu.

So with the standard k8s you have a choice of selecting your core sizes.
With autopilot this node selection is left to autopilot to deploy suitable
nodes and this will be a trial and error at the start (to get the
configuration right). You may be lucky if the history of executions are
kept current and the same job can be repeated. However, in my experience,
to procedure the driver pod in "running state" is expensive timewise and
without an executor in running state, there is no chance of spark job doing
anything


NAME                                         READY   STATUS    RESTARTS
 AGE

randomdatabigquery-cebab77eea6de971-exec-1   0/1     Pending   0
31s

randomdatabigquery-cebab77eea6de971-exec-2   0/1     Pending   0
31s

randomdatabigquery-cebab77eea6de971-exec-3   0/1     Pending   0
31s

randomdatabigquery-cebab77eea6de971-exec-4   0/1     Pending   0
31s

randomdatabigquery-cebab77eea6de971-exec-5   0/1     Pending   0
31s

randomdatabigquery-cebab77eea6de971-exec-6   0/1     Pending   0
31s

sparkbq-37405a7eea6b9468-driver              1/1     Running   0
3m4s


NAME                                         READY   STATUS
RESTARTS   AGE

randomdatabigquery-cebab77eea6de971-exec-6   0/1     ContainerCreating   0
        112s

sparkbq-37405a7eea6b9468-driver              1/1     Running             0
        4m25s

NAME                                         READY   STATUS    RESTARTS
 AGE

randomdatabigquery-cebab77eea6de971-exec-6   1/1     Running   0
114s

sparkbq-37405a7eea6b9468-driver              1/1     Running   0
4m27s

Basically I told Spak to have 6 executors but could only bring into running
state one executor after the driver pod spinning for 4 minutes.

22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring K8S
client using current context from users K8S config file

22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of
spark.dynamicAllocation.initialExecutors,
spark.dynamicAllocation.minExecutors and spark.executor.instances

22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3 executors
from Kubernetes for ResourceProfile Id: 0, target: 6 running: 0.

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
enabled, skipping shutdown script

22/02/11 20:16:20 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.

22/02/11 20:16:20 INFO NettyBlockTransferService: Server created on
sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079

22/02/11 20:16:20 INFO BlockManager: Using
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
policy

22/02/11 20:16:20 INFO BlockManagerMaster: Registering BlockManager
BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
None)

22/02/11 20:16:20 INFO BlockManagerMasterEndpoint: Registering block
manager sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 with 366.3 MiB
RAM, BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc,
7079, None)

22/02/11 20:16:20 INFO BlockManagerMaster: Registered BlockManager
BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
None)

22/02/11 20:16:20 INFO BlockManager: Initialized BlockManager:
BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
None)

22/02/11 20:16:20 INFO Utils: Using initial executors = 6, max of
spark.dynamicAllocation.initialExecutors,
spark.dynamicAllocation.minExecutors and spark.executor.instances

22/02/11 20:16:20 WARN ExecutorAllocationManager: Dynamic allocation
without a shuffle service is an experimental feature.

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
enabled, skipping shutdown script

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
enabled, skipping shutdown script

22/02/11 20:16:20 INFO ExecutorPodsAllocator: Going to request 3 executors
from Kubernetes for ResourceProfile Id: 0, target: 6 running: 3.

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
enabled, skipping shutdown script

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
enabled, skipping shutdown script

22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
enabled, skipping shutdown script

22/02/11 20:16:49 INFO KubernetesClusterSchedulerBackend: SchedulerBackend
is ready for scheduling beginning after waiting
maxRegisteredResourcesWaitingTime: 30000000000(ns)

22/02/11 20:16:49 INFO SharedState: Setting hive.metastore.warehouse.dir
('null') to the value of spark.sql.warehouse.dir
('file:/opt/spark/work-dir/spark-warehouse').

22/02/11 20:16:49 INFO SharedState: Warehouse path is
'file:/opt/spark/work-dir/spark-warehouse'.

OK there is a lot to digest here and I appreciate feedback from other
members that have experimented with GKE autopilot or AWS Fargate or are
familiar with k8s.

Thanks


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

Reply via email to