GitHub user chenlica created a discussion: Evaluation of Spark (from old wiki)

>From the page https://github.com/apache/texera/wiki/Evaluation-of-Spark (may 
>be dangling)

====
Author: Yuran Yan

# Overview
This page gives a summary of Apache Spark basics that has been discussed in 
this [GitHub Issue](https://github.com/Texera/texera/issues/633)

## Spark Cluster Architecture

There are 3 major components of a typical Spark cluster

**Driver Program**

This is the machine that users use to initiate the entire program, also known 
as the entry point of the application.

**Cluster Manager**

This is the machine that is used for managing the cluster. It keeps track of 
the states of the worker machines and tell the Driver which machine it can 
connect and send tasks. This machine is also known as the "Master node" of the 
cluster.


**Worker Node**

These are the machines that actually do the computation of all the spark 
applications. Each worker node has to connect to the Cluster Manager to become 
part of the Spark cluster. Each of the worker node will receive different tasks 
during the entire application process. The Worker Node is also known as the 
"Slave node" of the cluster.

<img width="488" alt="image" 
src="https://user-images.githubusercontent.com/19600235/42184891-e84a109a-7dfb-11e8-96f3-e47ada710362.png";>

There are also some important small components within each major component

**SparkContext**

This is the spark object the hold the spark application, which is also the 
beginning of the code of a spark program. Each spark application will have 
**exactly one** SparkContext object

**Executors**

Executors are the actual abstraction units that handles the spark computation. 
The Worker node, which is a machine, is just a place holder for executor(s). A 
Worker node can have one or more executors with each of them take control of 
part of the Worker node's cores and memories. Tasks will be sent to executors 
for execution. Each executor will have control on some CPU cores and memory of 
its working node.

**Tasks**

Task is the smallest unit of work for a spark program. For now, you can regard 
them as fragments of code or commands that the executors need to perform. The 
detail of its concept will be discussed in the "Physical plan" part.

For more references about the architecture of spark, read these links:

http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
https://spark.apache.org/docs/latest/cluster-overview.html

## Spark Programming Concept

Spark has its own way of processing data. All data inputs have to be 
transformed into the data type used by spark. There are two types of them.

**Resilient Distributed dataset(RDD)**

RDD is the core and basic data abstraction used by Apache Spark. It is a 
collection of elements partitioned across the nodes of the cluster that can be 
operated in parallel.
Three basic features about RDD:
1. Resilient: Able to recompute missing or damaged partitions due to failures. 
With help of RDD Lineage graph.
2. Distributed: Allow distribution for Multiple nodes
3. Dataset: Allow different kinds of value types

**Dataset(Dataframe)**

This new basic data abstraction is introduced after Spark 2.0. According to the 
Spark official site, it is strongly-typed like an RDD, but it has richer 
optimization. However, the actual performance compared to the RDD need to be 
explored. Most properties apply to RDD also apply to Dataset.
https://spark.apache.org/docs/latest/quick-start.html

**RDD operation and its laziness**

To prevent the system from doing unnecessary computation, Spark will evaluate 
all the RDD operation and will only do the computation if the result is 
explicitly asked by the Driver program. To understand what it means, you have 
to first understand the category of RDD operations.

First of all, all RDDs are immutable, which means no RDD can be changed and you 
can only build new RDDs based on the old one. The data of the old RDD will 
remain until it gets erased by the garbage collector unless get cached.
There are two types of RDD operations: Transformation and Action.

Transformation ONLY generate new RDDs from other RDDs. Both of the input and 
output are RDDs.

Action ONLY generate output whose type is not RDD. For all spark programs, data 
collections that are not RDD will be pushed back to the Driver program. This is 
where the Driver program explicitly asks for the computation results.

Following the definition. Operations like `map()` or `filter()` will be 
considered as Transformation.
Operation like `collect()` or `count()` will be considered as Action.
The operation `getNumPartitions()` does not belong to either of them.
Spark will never do any Transformation until it is asked to perform an Action.
So in the example given, if the last line with `collect()` is removed. No 
`map()` or `filter()` will be executed when you run the program. 

Be wary if the Spark program has multiple Actions that refer to the RDDs that 
are computed before the previous Actions, the RDDs will be recomputed unless 
been `persist()` or `cache()`. 

GitHub link: https://github.com/apache/texera/discussions/3966

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

Reply via email to