This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.4 by this push:
new cc662875cee [SPARK-42797][CONNECT][DOCS] Grammatical improvements for
Spark Connect content
cc662875cee is described below
commit cc662875cee3cccc67c3ee2a30f0d44d5b618ac8
Author: Allan Folting <[email protected]>
AuthorDate: Wed Mar 15 12:33:45 2023 +0900
[SPARK-42797][CONNECT][DOCS] Grammatical improvements for Spark Connect
content
### What changes were proposed in this pull request?
Grammatical improvements to the Spark Connect content as a follow-up on
https://github.com/apache/spark/pull/40324/
### Why are the changes needed?
To improve readability of the pages.
### Does this PR introduce _any_ user-facing change?
Yes, user-facing documentation is updated.
### How was this patch tested?
Built the doc website locally and checked the updates.
PRODUCTION=1 SKIP_RDOC=1 bundle exec jekyll build
Closes #40428 from allanf-db/connect_overview_doc.
Authored-by: Allan Folting <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 88d5c752829722b0b42f2c91fd57fb3e8fa17339)
Signed-off-by: Hyukjin Kwon <[email protected]>
---
docs/index.md | 14 +++++++-------
docs/spark-connect-overview.md | 28 ++++++++++++++--------------
2 files changed, 21 insertions(+), 21 deletions(-)
diff --git a/docs/index.md b/docs/index.md
index 4f24ad4edce..37b1311c306 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -21,7 +21,7 @@ license: |
---
Apache Spark is a unified analytics engine for large-scale data processing.
-It provides high-level APIs in Java, Scala, Python and R,
+It provides high-level APIs in Java, Scala, Python, and R,
and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including [Spark
SQL](sql-programming-guide.html) for SQL and structured data processing,
[pandas API on Spark](api/python/getting_started/quickstart_ps.html) for pandas
workloads, [MLlib](ml-guide.html) for machine learning,
[GraphX](graphx-programming-guide.html) for graph processing, and [Structured
Streaming](structured-streaming-programming-guide.html) for incremental
computation and stream processing.
@@ -39,17 +39,17 @@ source, visit [Building Spark](building-spark.html).
Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS), and it
should run on any platform that runs a supported version of Java. This should
include JVMs on x86_64 and ARM64. It's easy to run locally on one machine ---
all you need is to have `java` installed on your system `PATH`, or the
`JAVA_HOME` environment variable pointing to a Java installation.
-Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+ and R 3.5+.
+Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+, and R 3.5+.
Python 3.7 support is deprecated as of Spark 3.4.0.
Java 8 prior to version 8u362 support is deprecated as of Spark 3.4.0.
When using the Scala API, it is necessary for applications to use the same
version of Scala that Spark was compiled for.
For example, when using Scala 2.13, use Spark compiled for 2.13, and compile
code/applications for Scala 2.13 as well.
-For Java 11, `-Dio.netty.tryReflectionSetAccessible=true` is required
additionally for Apache Arrow library. This prevents
`java.lang.UnsupportedOperationException: sun.misc.Unsafe or
java.nio.DirectByteBuffer.(long, int) not available` when Apache Arrow uses
Netty internally.
+For Java 11, setting `-Dio.netty.tryReflectionSetAccessible=true` is required
for the Apache Arrow library. This prevents the
`java.lang.UnsupportedOperationException: sun.misc.Unsafe or
java.nio.DirectByteBuffer.(long, int) not available` error when Apache Arrow
uses Netty internally.
# Running the Examples and Shell
-Spark comes with several sample programs. Python, Scala, Java and R examples
are in the
+Spark comes with several sample programs. Python, Scala, Java, and R examples
are in the
`examples/src/main` directory.
To run Spark interactively in a Python interpreter, use
@@ -77,14 +77,14 @@ great way to learn the framework.
The `--master` option specifies the
[master URL for a distributed
cluster](submitting-applications.html#master-urls), or `local` to run
locally with one thread, or `local[N]` to run locally with N threads. You
should start by using
-`local` for testing. For a full list of options, run Spark shell with the
`--help` option.
+`local` for testing. For a full list of options, run the Spark shell with the
`--help` option.
-Spark also provides an [R API](sparkr.html) since 1.4 (only DataFrame APIs are
included).
+Since version 1.4, Spark has provided an [R API](sparkr.html) (only the
DataFrame APIs are included).
To run Spark interactively in an R interpreter, use `bin/sparkR`:
./bin/sparkR --master "local[2]"
-Example applications are also provided in R. For example,
+Example applications are also provided in R. For example:
./bin/spark-submit examples/src/main/r/dataframe.R
diff --git a/docs/spark-connect-overview.md b/docs/spark-connect-overview.md
index e46fb9ad913..f942a884873 100644
--- a/docs/spark-connect-overview.md
+++ b/docs/spark-connect-overview.md
@@ -44,13 +44,13 @@ The Spark Connect client translates DataFrame operations
into unresolved
logical query plans which are encoded using protocol buffers. These are sent
to the server using the gRPC framework.
-The Spark Connect endpoint embedded on the Spark Server, receives and
+The Spark Connect endpoint embedded on the Spark Server receives and
translates unresolved logical plans into Spark's logical plan operators.
This is similar to parsing a SQL query, where attributes and relations are
parsed and an initial parse plan is built. From there, the standard Spark
execution process kicks in, ensuring that Spark Connect leverages all of
Spark's optimizations and enhancements. Results are streamed back to the
-client via gRPC as Apache Arrow-encoded row batches.
+client through gRPC as Apache Arrow-encoded row batches.
<p style="text-align: center;">
<img src="img/spark-connect-communication.png" title="Spark Connect
communication" alt="Spark Connect communication" />
@@ -67,11 +67,11 @@ own dependencies on the client and don't need to worry
about potential conflicts
with the Spark driver.
**Upgradability**: The Spark driver can now seamlessly be upgraded
independently
-of applications, e.g. to benefit from performance improvements and security
fixes.
+of applications, for example to benefit from performance improvements and
security fixes.
This means applications can be forward-compatible, as long as the server-side
RPC
definitions are designed to be backwards compatible.
-**Debuggability and Observability**: Spark Connect enables interactive
debugging
+**Debuggability and observability**: Spark Connect enables interactive
debugging
during development directly from your favorite IDE. Similarly, applications can
be monitored using the application's framework native metrics and logging
libraries.
@@ -106,8 +106,8 @@ Spark Connect, like in this example:
Note that we include a Spark Connect package (`spark-connect_2.12:3.4.0`),
when starting
Spark server. This is required to use Spark Connect. Make sure to use the same
version
-of the package as the Spark version you downloaded above. In the example here,
Spark 3.4.0
-with Scala 2.12.
+of the package as the Spark version you downloaded previously. In this example,
+Spark 3.4.0 with Scala 2.12.
Now Spark server is running and ready to accept Spark Connect sessions from
client
applications. In the next section we will walk through how to use Spark Connect
@@ -116,7 +116,7 @@ when writing client applications.
## Use Spark Connect in client applications
When creating a Spark session, you can specify that you want to use Spark
Connect
-and there are a few ways to do that as outlined below.
+and there are a few ways to do that outlined as follows.
If you do not use one of the mechanisms outlined here, your Spark session will
work just like before, without leveraging Spark Connect, and your application
code
@@ -125,12 +125,12 @@ will run on the Spark driver node.
### Set SPARK_REMOTE environment variable
If you set the `SPARK_REMOTE` environment variable on the client machine where
your
-Spark client application is running and create a new Spark Session as
illustrated
-below, the session will be a Spark Connect session. With this approach, there
is
-no code change needed to start using Spark Connect.
+Spark client application is running and create a new Spark Session as in the
following
+example, the session will be a Spark Connect session. With this approach,
there is no
+code change needed to start using Spark Connect.
In a terminal window, set the `SPARK_REMOTE` environment variable to point to
the
-local Spark server you started on your computer above:
+local Spark server you started previously on your computer:
{% highlight bash %}
export SPARK_REMOTE="sc://localhost"
@@ -164,8 +164,8 @@ spark = SparkSession.builder.getOrCreate()
</div>
-Which will create a Spark Connect session from your application by reading the
-`SPARK_REMOTE` environment variable we set above.
+This will create a Spark Connect session from your application by reading the
+`SPARK_REMOTE` environment variable we set previously.
### Specify Spark Connect when creating Spark session
@@ -180,7 +180,7 @@ illustrated here.
<div data-lang="python" markdown="1">
To launch the PySpark shell with Spark Connect, simply include the `remote`
parameter and specify the location of your Spark server. We are using
`localhost`
-in this example to connect to the local Spark server we started above.
+in this example to connect to the local Spark server we started previously.
{% highlight bash %}
./bin/pyspark --remote "sc://localhost"
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]