Re: [PR] HDDS-14303. Updating spark integration doc [ozone-site]

via GitHub Wed, 14 Jan 2026 10:29:09 -0800


jojochuang commented on code in PR #243:
URL: https://github.com/apache/ozone-site/pull/243#discussion_r2691567181



##########
docs/04-user-guide/03-integrations/06-spark.md:
##########
@@ -1,3 +1,166 @@
-# Spark
+---
+sidebar_label: Spark
+---
 
-**TODO:** File a subtask under 
[HDDS-9858](https://issues.apache.org/jira/browse/HDDS-9858) and complete this 
page or section.
+# Using Apache Spark with Ozone
+
+Apache Spark is a widely used unified analytics engine for large-scale data 
processing. Ozone can serve as a scalable storage layer for Spark applications, 
allowing you to read and write data directly from/to Ozone clusters using 
familiar Spark APIs.
+
+## Overview
+
+Spark interacts with Ozone primarily through the OzoneFileSystem (ofs) 
connector, which allows access using the `ofs://` URI scheme. You can also use 
the older `o3fs://` scheme, though `ofs://` is generally recommended, 
especially in CDP environments.
+
+Key benefits include:
+
+- Storing large datasets generated or consumed by Spark jobs directly in Ozone.
+- Leveraging Ozone's scalability and object storage features for Spark 
workloads.
+- Using standard Spark DataFrame and RDD APIs to interact with Ozone data.
+
+## Prerequisites
+
+1. **Ozone Cluster:** A running Ozone cluster.
+2. **Ozone Client JARs:** The `hadoop-ozone-filesystem-hadoop3.jar` must be 
available on the Spark driver and executor classpath.

Review Comment:
   Renamed.
   It's ozone-filesystem-hadoop3.jar now. 
https://mvnrepository.com/artifact/org.apache.ozone/ozone-filesystem-hadoop3



##########
docs/04-user-guide/03-integrations/06-spark.md:
##########
@@ -1,3 +1,166 @@
-# Spark
+---
+sidebar_label: Spark
+---
 
-**TODO:** File a subtask under 
[HDDS-9858](https://issues.apache.org/jira/browse/HDDS-9858) and complete this 
page or section.
+# Using Apache Spark with Ozone
+
+Apache Spark is a widely used unified analytics engine for large-scale data 
processing. Ozone can serve as a scalable storage layer for Spark applications, 
allowing you to read and write data directly from/to Ozone clusters using 
familiar Spark APIs.
+
+## Overview
+
+Spark interacts with Ozone primarily through the OzoneFileSystem (ofs) 
connector, which allows access using the `ofs://` URI scheme. You can also use 
the older `o3fs://` scheme, though `ofs://` is generally recommended, 
especially in CDP environments.
+
+Key benefits include:
+
+- Storing large datasets generated or consumed by Spark jobs directly in Ozone.
+- Leveraging Ozone's scalability and object storage features for Spark 
workloads.
+- Using standard Spark DataFrame and RDD APIs to interact with Ozone data.
+
+## Prerequisites
+
+1. **Ozone Cluster:** A running Ozone cluster.
+2. **Ozone Client JARs:** The `hadoop-ozone-filesystem-hadoop3.jar` must be 
available on the Spark driver and executor classpath.
+3. **Configuration:** Spark needs access to Ozone configuration 
(`core-site.xml`and potentially`ozone-site.xml`) to connect to the Ozone 
cluster.
+
+## Configuration
+
+### 1. Core Site (`core-site.xml`)
+
+For `core-site.xml` configuration, refer to the [Ozone File System (ofs) 
Configuration section](../01-client-interfaces/02-ofs.md#configuration).
+
+### 2. Spark Configuration (`spark-defaults.conf` or `--conf`)
+
+While Spark often picks up settings from `core-site.xml` on the classpath, 
explicitly setting the implementation can sometimes be necessary:
+
+```properties
+spark.hadoop.fs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem
+spark.hadoop.fs.o3fs.impl=org.apache.hadoop.fs.ozone.OzoneFileSystem
+```
+
+### 3. Client JAR Placement
+
+Copy the `hadoop-ozone-filesystem-*.jar` to the `$SPARK_HOME/jars/` directory 
on all nodes where Spark driver and executors run. Alternatively, provide it 
using the `--jars` option in `Spark-submit`.
+
+### 4. Security (Kerberos)
+
+If your Ozone and Spark clusters are Kerberos-enabled, Spark needs permission 
to obtain delegation tokens for Ozone. Configure the following property in 
`spark-defaults.conf`or via`--conf`, specifying your Ozone filesystem URI:
+
+```properties
+# For YARN deployments
+spark.yarn.access.hadoopFileSystems=ofs://ozone1/
+```
+
+Replace `ozone1` with your OM Service ID. Ensure the user running the Spark 
job has a valid Kerberos ticket (`kinit`).
+
+## Usage Examples
+
+You can read and write data using `ofs://` URIs like any other 
Hadoop-compatible filesystem.
+
+**URI Format:** `ofs://<om-service-id>/<volume>/<bucket>/path/to/key>`
+
+### Reading Data (Scala)
+
+```scala
+import org.apache.spark.sql.SparkSession
+
+val spark = SparkSession.builder.appName("Ozone Spark Read 
Example").getOrCreate()
+
+// Read a CSV file from Ozone
+val df = spark.read.format("csv")
+  .option("header", "true")
+  .option("inferSchema", "true")
+  .load("ofs://ozone1/volume1/bucket1/input/data.csv")
+
+df.show()
+
+spark.stop()
+```
+
+### Writing Data (Scala)
+
+```scala
+import org.apache.spark.sql.SparkSession
+
+val spark = SparkSession.builder.appName("Ozone Spark Write 
Example").getOrCreate()
+
+// Assume 'df' is a DataFrame you want to write
+val data = Seq(("Alice", 1), ("Bob", 2), ("Charlie", 3))
+val df = spark.createDataFrame(data).toDF("name", "id")
+
+// Write DataFrame to Ozone as Parquet files
+df.write.mode("overwrite")
+  .parquet("ofs://ozone1/volume1/bucket1/output/users.parquet")
+
+spark.stop()
+```
+
+### Reading Data (Python)
+
+```python
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder.appName("Ozone Spark Read Example").getOrCreate()
+
+# Read a CSV file from Ozone
+df = spark.read.format("csv") \
+    .option("header", "true") \
+    .option("inferSchema", "true") \
+    .load("ofs://ozone1/volume1/bucket1/input/data.csv")
+
+df.show()
+
+spark.stop()
+```
+
+### Writing Data (Python)
+
+```python
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder.appName("Ozone Spark Write Example").getOrCreate()
+
+# Assume 'df' is a DataFrame you want to write
+data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
+columns = ["name", "id"]
+df = spark.createDataFrame(data, columns)
+
+# Write DataFrame to Ozone as Parquet files
+df.write.mode("overwrite") \
+    .parquet("ofs://ozone1/volume1/bucket1/output/users.parquet")
+
+spark.stop()
+```
+
+## Spark on Kubernetes

Review Comment:
   i haven't tried the steps below.



##########
docs/04-user-guide/03-integrations/06-spark.md:
##########
@@ -1,3 +1,166 @@
-# Spark
+---
+sidebar_label: Spark
+---
 
-**TODO:** File a subtask under 
[HDDS-9858](https://issues.apache.org/jira/browse/HDDS-9858) and complete this 
page or section.
+# Using Apache Spark with Ozone
+
+Apache Spark is a widely used unified analytics engine for large-scale data 
processing. Ozone can serve as a scalable storage layer for Spark applications, 
allowing you to read and write data directly from/to Ozone clusters using 
familiar Spark APIs.
+
+## Overview
+
+Spark interacts with Ozone primarily through the OzoneFileSystem (ofs) 
connector, which allows access using the `ofs://` URI scheme. You can also use 
the older `o3fs://` scheme, though `ofs://` is generally recommended, 
especially in CDP environments.
+
+Key benefits include:
+
+- Storing large datasets generated or consumed by Spark jobs directly in Ozone.
+- Leveraging Ozone's scalability and object storage features for Spark 
workloads.
+- Using standard Spark DataFrame and RDD APIs to interact with Ozone data.
+
+## Prerequisites
+
+1. **Ozone Cluster:** A running Ozone cluster.
+2. **Ozone Client JARs:** The `hadoop-ozone-filesystem-hadoop3.jar` must be 
available on the Spark driver and executor classpath.

Review Comment:
   Renamed.
   It's ozone-filesystem-hadoop3.jar now. 
https://mvnrepository.com/artifact/org.apache.ozone/ozone-filesystem-hadoop3.
   
   Please update other references to hadoop-ozone-filesystem jar below.



##########
docs/04-user-guide/03-integrations/06-spark.md:
##########
@@ -1,3 +1,166 @@
-# Spark
+---
+sidebar_label: Spark
+---
 
-**TODO:** File a subtask under 
[HDDS-9858](https://issues.apache.org/jira/browse/HDDS-9858) and complete this 
page or section.
+# Using Apache Spark with Ozone
+
+Apache Spark is a widely used unified analytics engine for large-scale data 
processing. Ozone can serve as a scalable storage layer for Spark applications, 
allowing you to read and write data directly from/to Ozone clusters using 
familiar Spark APIs.
+
+## Overview
+
+Spark interacts with Ozone primarily through the OzoneFileSystem (ofs) 
connector, which allows access using the `ofs://` URI scheme. You can also use 
the older `o3fs://` scheme, though `ofs://` is generally recommended, 
especially in CDP environments.
+
+Key benefits include:
+
+- Storing large datasets generated or consumed by Spark jobs directly in Ozone.
+- Leveraging Ozone's scalability and object storage features for Spark 
workloads.
+- Using standard Spark DataFrame and RDD APIs to interact with Ozone data.
+
+## Prerequisites
+
+1. **Ozone Cluster:** A running Ozone cluster.
+2. **Ozone Client JARs:** The `hadoop-ozone-filesystem-hadoop3.jar` must be 
available on the Spark driver and executor classpath.
+3. **Configuration:** Spark needs access to Ozone configuration 
(`core-site.xml`and potentially`ozone-site.xml`) to connect to the Ozone 
cluster.
+
+## Configuration
+
+### 1. Core Site (`core-site.xml`)
+
+For `core-site.xml` configuration, refer to the [Ozone File System (ofs) 
Configuration section](../01-client-interfaces/02-ofs.md#configuration).
+
+### 2. Spark Configuration (`spark-defaults.conf` or `--conf`)
+
+While Spark often picks up settings from `core-site.xml` on the classpath, 
explicitly setting the implementation can sometimes be necessary:
+
+```properties
+spark.hadoop.fs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem
+spark.hadoop.fs.o3fs.impl=org.apache.hadoop.fs.ozone.OzoneFileSystem
+```
+
+### 3. Client JAR Placement
+
+Copy the `hadoop-ozone-filesystem-*.jar` to the `$SPARK_HOME/jars/` directory 
on all nodes where Spark driver and executors run. Alternatively, provide it 
using the `--jars` option in `Spark-submit`.
+
+### 4. Security (Kerberos)
+
+If your Ozone and Spark clusters are Kerberos-enabled, Spark needs permission 
to obtain delegation tokens for Ozone. Configure the following property in 
`spark-defaults.conf`or via`--conf`, specifying your Ozone filesystem URI:
+
+```properties
+# For YARN deployments
+spark.yarn.access.hadoopFileSystems=ofs://ozone1/

Review Comment:
   only works for Spark2. For Spark3 and above, use 
spark.kerberos.access.hadoopFileSystems



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-14303. Updating spark integration doc [ozone-site]

Reply via email to