This is an automated email from the ASF dual-hosted git repository.
yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git
The following commit(s) were added to refs/heads/master by this push:
new d02ff60 [SPARK-40513][DOCS] Add apache/spark docker image overview
d02ff60 is described below
commit d02ff6091835311a32c7ccc73d8ebae1d5817ecc
Author: Yikun Jiang <[email protected]>
AuthorDate: Tue Jun 27 14:28:21 2023 +0800
[SPARK-40513][DOCS] Add apache/spark docker image overview
### What changes were proposed in this pull request?
This PR add the `OVERVIEW.md`.
### Why are the changes needed?
This will be used in the page of https://hub.docker.com/r/apache/spark to
introduce the spark docker image and tag info.
### Does this PR introduce _any_ user-facing change?
Yes, doc only
### How was this patch tested?
Doc only, review.
Closes #34 from Yikun/overview.
Authored-by: Yikun Jiang <[email protected]>
Signed-off-by: Yikun Jiang <[email protected]>
---
OVERVIEW.md | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 83 insertions(+)
diff --git a/OVERVIEW.md b/OVERVIEW.md
new file mode 100644
index 0000000..0465555
--- /dev/null
+++ b/OVERVIEW.md
@@ -0,0 +1,83 @@
+# What is Apache Spark™?
+
+Apache Spark™ is a multi-language engine for executing data engineering, data
science, and machine learning on single-node machines or clusters. It provides
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
supports general computation graphs for data analysis. It also supports a rich
set of higher-level tools including Spark SQL for SQL and DataFrames, pandas
API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph
processing, and Structu [...]
+
+https://spark.apache.org/
+
+## Online Documentation
+
+You can find the latest Spark documentation, including a programming guide, on
the [project web page](https://spark.apache.org/documentation.html). This
README file only contains basic setup instructions.
+
+## Interactive Scala Shell
+
+The easiest way to start using Spark is through the Scala shell:
+
+```
+docker run -it apache/spark /opt/spark/bin/spark-shell
+```
+
+Try the following command, which should return 1,000,000,000:
+
+```
+scala> spark.range(1000 * 1000 * 1000).count()
+```
+
+## Interactive Python Shell
+
+The easiest way to start using PySpark is through the Python shell:
+
+```
+docker run -it apache/spark /opt/spark/bin/pyspark
+```
+
+And run the following command, which should also return 1,000,000,000:
+
+```
+>>> spark.range(1000 * 1000 * 1000).count()
+```
+
+## Interactive R Shell
+
+The easiest way to start using R on Spark is through the R shell:
+
+```
+docker run -it apache/spark:r /opt/spark/bin/sparkR
+```
+
+## Running Spark on Kubernetes
+
+https://spark.apache.org/docs/latest/running-on-kubernetes.html
+
+## Supported tags and respective Dockerfile links
+
+Currently, the `apache/spark` docker image supports 4 types for each version:
+
+Such as for v3.4.0:
+- [3.4.0-scala2.12-java11-python3-ubuntu, 3.4.0-python3, 3.4.0, python3,
latest](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-python3-ubuntu)
+- [3.4.0-scala2.12-java11-r-ubuntu, 3.4.0-r,
r](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-r-ubuntu)
+- [3.4.0-scala2.12-java11-ubuntu, 3.4.0-scala,
scala](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-ubuntu)
+-
[3.4.0-scala2.12-java11-python3-r-ubuntu](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-python3-r-ubuntu)
+
+## Environment Variable
+
+The environment variables of entrypoint.sh are listed below:
+
+| Environment Variable | Meaning |
+|----------------------|-----------|
+| SPARK_EXTRA_CLASSPATH | The extra path to be added to the classpath, see
also in
https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management
|
+| PYSPARK_PYTHON | Python binary executable to use for PySpark in both driver
and workers (default is python3 if available, otherwise python). Property
spark.pyspark.python take precedence if it is set |
+| PYSPARK_DRIVER_PYTHON | Python binary executable to use for PySpark in
driver only (default is PYSPARK_PYTHON). Property spark.pyspark.driver.python
take precedence if it is set |
+| SPARK_DIST_CLASSPATH | Distribution-defined classpath to add to processes |
+| SPARK_DRIVER_BIND_ADDRESS | Hostname or IP address where to bind listening
sockets. See also `spark.driver.bindAddress` |
+| SPARK_EXECUTOR_JAVA_OPTS | The Java opts of Spark Executor |
+| SPARK_APPLICATION_ID | A unique identifier for the Spark application |
+| SPARK_EXECUTOR_POD_IP | The Pod IP address of spark executor |
+| SPARK_RESOURCE_PROFILE_ID | The resource profile ID |
+| SPARK_EXECUTOR_POD_NAME | The executor pod name |
+| SPARK_CONF_DIR | Alternate conf dir. (Default: ${SPARK_HOME}/conf) |
+| SPARK_EXECUTOR_CORES | Number of cores for the executors (Default: 1) |
+| SPARK_EXECUTOR_MEMORY | Memory per Executor (e.g. 1000M, 2G) (Default: 1G) |
+| SPARK_DRIVER_MEMORY | Memory for Driver (e.g. 1000M, 2G) (Default: 1G) |
+
+See also in https://spark.apache.org/docs/latest/configuration.html and
https://spark.apache.org/docs/latest/running-on-kubernetes.html
+
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]