[D] Evaluation of Spark (from old wiki) [texera]

via GitHub Mon, 20 Oct 2025 22:16:50 -0700


GitHub user chenlica created a discussion: Evaluation of Spark (from old wiki)

>From the page https://github.com/apache/texera/wiki/Evaluation-of-Spark (may
>be dangling)

====
Author: Yuran Yan

# Overview
This page gives a summary of Apache Spark basics that has been discussed in
this [GitHub Issue](https://github.com/Texera/texera/issues/633)

## Spark Cluster Architecture

There are 3 major components of a typical Spark cluster

**Driver Program**

This is the machine that users use to initiate the entire program, also known
as the entry point of the application.

**Cluster Manager**

This is the machine that is used for managing the cluster. It keeps track of
the states of the worker machines and tell the Driver which machine it can
connect and send tasks. This machine is also known as the "Master node" of the
cluster.

**Worker Node**

These are the machines that actually do the computation of all the spark
applications. Each worker node has to connect to the Cluster Manager to become
part of the Spark cluster. Each of the worker node will receive different tasks
during the entire application process. The Worker Node is also known as the
"Slave node" of the cluster.

There are also some important small components within each major component

**SparkContext**

This is the spark object the hold the spark application, which is also the
beginning of the code of a spark program. Each spark application will have
**exactly one** SparkContext object

**Executors**

Executors are the actual abstraction units that handles the spark computation.
The Worker node, which is a machine, is just a place holder for executor(s). A
Worker node can have one or more executors with each of them take control of
part of the Worker node's cores and memories. Tasks will be sent to executors
for execution. Each executor will have control on some CPU cores and memory of
its working node.

**Tasks**

Task is the smallest unit of work for a spark program. For now, you can regard
them as fragments of code or commands that the executors need to perform. The
detail of its concept will be discussed in the "Physical plan" part.

For more references about the architecture of spark, read these links:

http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
https://spark.apache.org/docs/latest/cluster-overview.html

## Spark Programming Concept

Spark has its own way of processing data. All data inputs have to be
transformed into the data type used by spark. There are two types of them.

**Resilient Distributed dataset(RDD)**

RDD is the core and basic data abstraction used by Apache Spark. It is a
collection of elements partitioned across the nodes of the cluster that can be
operated in parallel.
Three basic features about RDD:
1. Resilient: Able to recompute missing or damaged partitions due to failures.
With help of RDD Lineage graph.
2. Distributed: Allow distribution for Multiple nodes
3. Dataset: Allow different kinds of value types

**Dataset(Dataframe)**

This new basic data abstraction is introduced after Spark 2.0. According to the
Spark official site, it is strongly-typed like an RDD, but it has richer
optimization. However, the actual performance compared to the RDD need to be
explored. Most properties apply to RDD also apply to Dataset.
https://spark.apache.org/docs/latest/quick-start.html

**RDD operation and its laziness**

To prevent the system from doing unnecessary computation, Spark will evaluate
all the RDD operation and will only do the computation if the result is
explicitly asked by the Driver program. To understand what it means, you have
to first understand the category of RDD operations.

First of all, all RDDs are immutable, which means no RDD can be changed and you
can only build new RDDs based on the old one. The data of the old RDD will
remain until it gets erased by the garbage collector unless get cached.
There are two types of RDD operations: Transformation and Action.

Transformation ONLY generate new RDDs from other RDDs. Both of the input and
output are RDDs.

Action ONLY generate output whose type is not RDD. For all spark programs, data
collections that are not RDD will be pushed back to the Driver program. This is
where the Driver program explicitly asks for the computation results.

Following the definition. Operations like `map()` or `filter()` will be
considered as Transformation.
Operation like `collect()` or `count()` will be considered as Action.
The operation `getNumPartitions()` does not belong to either of them.
Spark will never do any Transformation until it is asked to perform an Action.
So in the example given, if the last line with `collect()` is removed. No
`map()` or `filter()` will be executed when you run the program.

Be wary if the Spark program has multiple Actions that refer to the RDDs that
are computed before the previous Actions, the RDDs will be recomputed unless
been `persist()` or `cache()`.

GitHub link: https://github.com/apache/texera/discussions/3966

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to:
[email protected]

[D] Evaluation of Spark (from old wiki) [texera]

Reply via email to