Re: [PR] Adds comprehensive top-level documentation to the repository [spark]

via GitHub Sun, 19 Oct 2025 10:44:50 -0700


Copilot commented on code in PR #52656:
URL: https://github.com/apache/spark/pull/52656#discussion_r2443414648



##########
sbin/README.md:
##########
@@ -0,0 +1,514 @@
+# Spark Admin Scripts
+
+This directory contains administrative scripts for managing Spark standalone 
clusters.
+
+## Overview
+
+The `sbin/` scripts are used by cluster administrators to:
+- Start and stop Spark standalone clusters
+- Start and stop individual daemons (master, workers, history server)
+- Manage cluster lifecycle
+- Configure cluster nodes
+
+**Note**: These scripts are for **Spark Standalone** cluster mode only. For 
YARN, Kubernetes, or Mesos, use their respective cluster management tools.
+
+## Cluster Management Scripts
+
+### start-all.sh / stop-all.sh
+
+Start or stop all Spark daemons on the cluster.
+
+**Usage:**
+```bash
+# Start master and all workers
+./sbin/start-all.sh
+
+# Stop all daemons
+./sbin/stop-all.sh
+```
+
+**What they do:**
+- `start-all.sh`: Starts master on the current machine and workers on machines 
listed in `conf/workers`
+- `stop-all.sh`: Stops all master and worker daemons
+
+**Prerequisites:**
+- SSH key-based authentication configured
+- `conf/workers` file with worker hostnames
+- Spark installed at same location on all machines
+
+**Configuration files:**
+- `conf/workers`: List of worker hostnames (one per line)
+- `conf/spark-env.sh`: Environment variables
+
+### start-master.sh / stop-master.sh
+
+Start or stop the Spark master daemon on the current machine.
+
+**Usage:**
+```bash
+# Start master
+./sbin/start-master.sh
+
+# Stop master
+./sbin/stop-master.sh
+```
+
+**Master Web UI**: Access at `http://<master-host>:8080/`
+
+**Configuration:**
+```bash
+# In conf/spark-env.sh
+export SPARK_MASTER_HOST=master-hostname
+export SPARK_MASTER_PORT=7077
+export SPARK_MASTER_WEBUI_PORT=8080
+```
+
+### start-worker.sh / stop-worker.sh
+
+Start or stop a Spark worker daemon on the current machine.
+
+**Usage:**
+```bash
+# Start worker connecting to master
+./sbin/start-worker.sh spark://master:7077
+
+# Stop worker
+./sbin/stop-worker.sh
+```
+
+**Worker Web UI**: Access at `http://<worker-host>:8081/`
+
+**Configuration:**
+```bash
+# In conf/spark-env.sh
+export SPARK_WORKER_CORES=8      # Number of cores to use
+export SPARK_WORKER_MEMORY=16g   # Memory to allocate
+export SPARK_WORKER_PORT=7078    # Worker port
+export SPARK_WORKER_WEBUI_PORT=8081
+export SPARK_WORKER_DIR=/var/spark/work  # Work directory
+```
+
+### start-workers.sh / stop-workers.sh
+
+Start or stop workers on all machines listed in `conf/workers`.
+
+**Usage:**
+```bash
+# Start all workers
+./sbin/start-workers.sh spark://master:7077
+
+# Stop all workers
+./sbin/stop-workers.sh
+```
+
+**Requirements:**
+- `conf/workers` file configured
+- SSH access to all worker machines
+- Master URL (for starting)
+
+## History Server Scripts
+
+### start-history-server.sh / stop-history-server.sh
+
+Start or stop the Spark History Server for viewing completed application logs.
+
+**Usage:**
+```bash
+# Start history server
+./sbin/start-history-server.sh
+
+# Stop history server
+./sbin/stop-history-server.sh
+```
+
+**History Server UI**: Access at `http://<host>:18080/`
+
+**Configuration:**
+```properties
+# In conf/spark-defaults.conf
+spark.history.fs.logDirectory=hdfs://namenode/spark-logs
+spark.history.ui.port=18080
+spark.eventLog.enabled=true
+spark.eventLog.dir=hdfs://namenode/spark-logs
+```
+
+**Requirements:**
+- Applications must have event logging enabled
+- Log directory must be accessible
+
+## Shuffle Service Scripts
+
+### start-shuffle-service.sh / stop-shuffle-service.sh
+
+Start or stop the external shuffle service (for YARN).
+
+**Usage:**
+```bash
+# Start shuffle service
+./sbin/start-shuffle-service.sh
+
+# Stop shuffle service
+./sbin/stop-shuffle-service.sh
+```
+
+**Note**: Typically used only when running on YARN without the YARN auxiliary 
service.
+
+## Configuration Files
+
+### conf/workers
+
+Lists worker hostnames, one per line.
+
+**Example:**
+```
+worker1.example.com
+worker2.example.com
+worker3.example.com
+```
+
+**Usage:**
+- Used by `start-all.sh` and `start-workers.sh`
+- Each line should contain a hostname or IP address
+- Blank lines and lines starting with `#` are ignored
+
+### conf/spark-env.sh
+
+Environment variables for Spark daemons.
+
+**Example:**
+```bash
+#!/usr/bin/env bash
+
+# Java
+export JAVA_HOME=/usr/lib/jvm/java-17
+
+# Master settings
+export SPARK_MASTER_HOST=master.example.com
+export SPARK_MASTER_PORT=7077
+export SPARK_MASTER_WEBUI_PORT=8080
+
+# Worker settings
+export SPARK_WORKER_CORES=8
+export SPARK_WORKER_MEMORY=16g
+export SPARK_WORKER_PORT=7078
+export SPARK_WORKER_WEBUI_PORT=8081
+export SPARK_WORKER_DIR=/var/spark/work
+
+# Directories
+export SPARK_LOG_DIR=/var/log/spark
+export SPARK_PID_DIR=/var/run/spark
+
+# History Server
+export 
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://namenode/spark-logs"
+
+# Additional Java options
+export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
-Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181"
+```
+
+**Key Variables:**
+
+**Master:**
+- `SPARK_MASTER_HOST`: Master hostname
+- `SPARK_MASTER_PORT`: Master port (default: 7077)
+- `SPARK_MASTER_WEBUI_PORT`: Web UI port (default: 8080)
+
+**Worker:**
+- `SPARK_WORKER_CORES`: Number of cores per worker
+- `SPARK_WORKER_MEMORY`: Memory per worker (e.g., 16g)
+- `SPARK_WORKER_PORT`: Worker communication port
+- `SPARK_WORKER_WEBUI_PORT`: Worker web UI port (default: 8081)
+- `SPARK_WORKER_DIR`: Directory for scratch space and logs
+- `SPARK_WORKER_INSTANCES`: Number of worker instances per machine
+
+**General:**
+- `SPARK_LOG_DIR`: Directory for daemon logs
+- `SPARK_PID_DIR`: Directory for PID files
+- `SPARK_IDENT_STRING`: Identifier for daemons (default: username)
+- `SPARK_NICENESS`: Nice value for daemons
+- `SPARK_DAEMON_MEMORY`: Memory for daemon processes
+
+## Setting Up a Standalone Cluster
+
+### Step 1: Install Spark on All Nodes
+
+```bash
+# Download and extract Spark on each machine
+tar xzf spark-X.Y.Z-bin-hadoopX.tgz
+cd spark-X.Y.Z-bin-hadoopX
+```
+
+### Step 2: Configure spark-env.sh
+
+Create `conf/spark-env.sh` from template:
+```bash
+cp conf/spark-env.sh.template conf/spark-env.sh
+# Edit conf/spark-env.sh with appropriate settings
+```
+
+### Step 3: Configure Workers File
+
+Create `conf/workers`:
+```bash
+cp conf/workers.template conf/workers
+# Add worker hostnames, one per line
+```
+
+### Step 4: Configure SSH Access
+
+Set up password-less SSH from master to all workers:
+```bash
+ssh-keygen -t rsa
+ssh-copy-id user@worker1
+ssh-copy-id user@worker2
+# ... for each worker
+```
+
+### Step 5: Synchronize Configuration
+
+Copy configuration to all workers:
+```bash
+for host in $(cat conf/workers); do
+  rsync -av conf/ user@$host:spark/conf/
+done
+```
+
+### Step 6: Start the Cluster
+
+```bash
+./sbin/start-all.sh
+```
+
+### Step 7: Verify
+
+- Check master UI: `http://master:8080`
+- Check worker UIs: `http://worker1:8081`, etc.
+- Look for workers registered with master
+
+## High Availability
+
+For production deployments, configure high availability with ZooKeeper.
+
+### ZooKeeper-based HA Configuration
+
+**In conf/spark-env.sh:**
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.deploy.recoveryMode=ZOOKEEPER
+  -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181,zk3:2181
+  -Dspark.deploy.zookeeper.dir=/spark
+"
+```
+
+### Start Multiple Masters
+
+```bash
+# On master1
+./sbin/start-master.sh
+
+# On master2
+./sbin/start-master.sh
+
+# On master3
+./sbin/start-master.sh
+```
+
+### Connect Workers to All Masters
+
+```bash
+./sbin/start-worker.sh spark://master1:7077,master2:7077,master3:7077
+```
+
+**Automatic failover:** If active master fails, standby masters detect the 
failure and one becomes active.
+
+## Monitoring and Logs
+
+### Log Files
+
+Daemon logs are written to `$SPARK_LOG_DIR` (default: `logs/`):
+
+```bash
+# Master log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.master.Master-*.out
+
+# Worker log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.worker.Worker-*.out
+
+# History Server log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.history.HistoryServer-*.out
+```
+
+### View Logs
+
+```bash
+# Tail master log
+tail -f logs/spark-*-master-*.out
+
+# Tail worker log
+tail -f logs/spark-*-worker-*.out
+
+# Search for errors
+grep ERROR logs/spark-*-master-*.out
+```
+
+### Web UIs
+
+- **Master UI**: `http://<master>:8080` - Cluster status, workers, applications
+- **Worker UI**: `http://<worker>:8081` - Worker status, running executors
+- **Application UI**: `http://<driver>:4040` - Running application metrics
+- **History Server**: `http://<history-server>:18080` - Completed applications
+
+## Advanced Configuration
+
+### Memory Overhead
+
+Reserve memory for system processes:
+```bash
+export SPARK_DAEMON_MEMORY=2g
+```
+
+### Multiple Workers per Machine
+
+Run multiple worker instances on a single machine:
+```bash
+export SPARK_WORKER_INSTANCES=2
+export SPARK_WORKER_CORES=4      # Cores per instance
+export SPARK_WORKER_MEMORY=8g    # Memory per instance
+```
+
+### Work Directory
+
+Change worker scratch space:
+```bash
+export SPARK_WORKER_DIR=/mnt/fast-disk/spark-work
+```
+
+### Port Configuration
+
+Use non-default ports:
+```bash
+export SPARK_MASTER_PORT=9077
+export SPARK_MASTER_WEBUI_PORT=9080
+export SPARK_WORKER_PORT=9078
+export SPARK_WORKER_WEBUI_PORT=9081
+```
+
+## Security
+
+### Enable Authentication
+
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.authenticate=true
+  -Dspark.authenticate.secret=your-secret-key
+"
+```
+
+### Enable SSL
+
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.ssl.enabled=true
+  -Dspark.ssl.keyStore=/path/to/keystore
+  -Dspark.ssl.keyStorePassword=password
+  -Dspark.ssl.trustStore=/path/to/truststore
+  -Dspark.ssl.trustStorePassword=password
+"
+```
+
+## Troubleshooting
+
+### Master Won't Start
+
+**Check:**
+1. Port 7077 is available: `netstat -an | grep 7077`
+2. Hostname is resolvable: `ping $SPARK_MASTER_HOST`
+3. Logs for errors: `cat logs/spark-*-master-*.out`
+
+### Workers Not Connecting
+
+**Check:**
+1. Master URL is correct
+2. Network connectivity: `telnet master 7077`
+3. Firewall allows connections
+4. Worker logs: `cat logs/spark-*-worker-*.out`
+
+### SSH Connection Issues
+
+**Solutions:**
+1. Verify SSH key: `ssh worker1 echo test`
+2. Check SSH config: `~/.ssh/config`
+3. Use SSH agent: `eval $(ssh-agent); ssh-add`
+
+### Insufficient Resources
+
+**Check:**
+- Worker has enough memory: `free -h`
+- Enough cores available: `nproc`
+- Disk space: `df -h`
+
+## Cluster Shutdown
+
+### Graceful Shutdown
+
+```bash
+# Stop all workers first
+./sbin/stop-workers.sh
+
+# Stop master
+./sbin/stop-master.sh
+
+# Or stop everything
+./sbin/stop-all.sh
+```
+
+### Check All Stopped
+
+```bash
+# Check for running Java processes
+jps | grep -E "(Master|Worker)"
+```
+
+### Force Kill if Needed
+
+```bash
+# Kill any remaining Spark processes
+pkill -f org.apache.spark.deploy
+```
+
+## Best Practices
+
+1. **Use HA in production**: Configure ZooKeeper-based HA
+2. **Monitor resources**: Watch CPU, memory, disk usage
+3. **Separate log directories**: Use dedicated disk for logs
+4. **Regular maintenance**: Clean old logs and application data
+5. **Automate startup**: Use systemd or init scripts
+6. **Configure limits**: Set file descriptor and process limits
+7. **Use external shuffle service**: For better fault tolerance
+8. **Back up metadata**: Regularly back up ZooKeeper data
+
+## Scripts Reference
+
+| Script | Purpose |
+|--------|---------|
+| `start-all.sh` | Start master and all workers |
+| `stop-all.sh` | Stop master and all workers |
+| `start-master.sh` | Start master on current machine |
+| `stop-master.sh` | Stop master |
+| `start-worker.sh` | Start worker on current machine |
+| `stop-worker.sh` | Stop worker |
+| `start-workers.sh` | Start workers on all machines in `conf/workers` |
+| `stop-workers.sh` | Stop all workers |
+| `start-history-server.sh` | Start history server |
+| `stop-history-server.sh` | Stop history server |

Review Comment:
   Corrected table formatting.



##########
sbin/README.md:
##########
@@ -0,0 +1,514 @@
+# Spark Admin Scripts
+
+This directory contains administrative scripts for managing Spark standalone 
clusters.
+
+## Overview
+
+The `sbin/` scripts are used by cluster administrators to:
+- Start and stop Spark standalone clusters
+- Start and stop individual daemons (master, workers, history server)
+- Manage cluster lifecycle
+- Configure cluster nodes
+
+**Note**: These scripts are for **Spark Standalone** cluster mode only. For 
YARN, Kubernetes, or Mesos, use their respective cluster management tools.
+
+## Cluster Management Scripts
+
+### start-all.sh / stop-all.sh
+
+Start or stop all Spark daemons on the cluster.
+
+**Usage:**
+```bash
+# Start master and all workers
+./sbin/start-all.sh
+
+# Stop all daemons
+./sbin/stop-all.sh
+```
+
+**What they do:**
+- `start-all.sh`: Starts master on the current machine and workers on machines 
listed in `conf/workers`
+- `stop-all.sh`: Stops all master and worker daemons
+
+**Prerequisites:**
+- SSH key-based authentication configured
+- `conf/workers` file with worker hostnames
+- Spark installed at same location on all machines
+
+**Configuration files:**
+- `conf/workers`: List of worker hostnames (one per line)
+- `conf/spark-env.sh`: Environment variables
+
+### start-master.sh / stop-master.sh
+
+Start or stop the Spark master daemon on the current machine.
+
+**Usage:**
+```bash
+# Start master
+./sbin/start-master.sh
+
+# Stop master
+./sbin/stop-master.sh
+```
+
+**Master Web UI**: Access at `http://<master-host>:8080/`
+
+**Configuration:**
+```bash
+# In conf/spark-env.sh
+export SPARK_MASTER_HOST=master-hostname
+export SPARK_MASTER_PORT=7077
+export SPARK_MASTER_WEBUI_PORT=8080
+```
+
+### start-worker.sh / stop-worker.sh
+
+Start or stop a Spark worker daemon on the current machine.
+
+**Usage:**
+```bash
+# Start worker connecting to master
+./sbin/start-worker.sh spark://master:7077
+
+# Stop worker
+./sbin/stop-worker.sh
+```
+
+**Worker Web UI**: Access at `http://<worker-host>:8081/`
+
+**Configuration:**
+```bash
+# In conf/spark-env.sh
+export SPARK_WORKER_CORES=8      # Number of cores to use
+export SPARK_WORKER_MEMORY=16g   # Memory to allocate
+export SPARK_WORKER_PORT=7078    # Worker port
+export SPARK_WORKER_WEBUI_PORT=8081
+export SPARK_WORKER_DIR=/var/spark/work  # Work directory
+```
+
+### start-workers.sh / stop-workers.sh
+
+Start or stop workers on all machines listed in `conf/workers`.
+
+**Usage:**
+```bash
+# Start all workers
+./sbin/start-workers.sh spark://master:7077
+
+# Stop all workers
+./sbin/stop-workers.sh
+```
+
+**Requirements:**
+- `conf/workers` file configured
+- SSH access to all worker machines
+- Master URL (for starting)
+
+## History Server Scripts
+
+### start-history-server.sh / stop-history-server.sh
+
+Start or stop the Spark History Server for viewing completed application logs.
+
+**Usage:**
+```bash
+# Start history server
+./sbin/start-history-server.sh
+
+# Stop history server
+./sbin/stop-history-server.sh
+```
+
+**History Server UI**: Access at `http://<host>:18080/`
+
+**Configuration:**
+```properties
+# In conf/spark-defaults.conf
+spark.history.fs.logDirectory=hdfs://namenode/spark-logs
+spark.history.ui.port=18080
+spark.eventLog.enabled=true
+spark.eventLog.dir=hdfs://namenode/spark-logs
+```
+
+**Requirements:**
+- Applications must have event logging enabled
+- Log directory must be accessible
+
+## Shuffle Service Scripts
+
+### start-shuffle-service.sh / stop-shuffle-service.sh
+
+Start or stop the external shuffle service (for YARN).
+
+**Usage:**
+```bash
+# Start shuffle service
+./sbin/start-shuffle-service.sh
+
+# Stop shuffle service
+./sbin/stop-shuffle-service.sh
+```
+
+**Note**: Typically used only when running on YARN without the YARN auxiliary 
service.
+
+## Configuration Files
+
+### conf/workers
+
+Lists worker hostnames, one per line.
+
+**Example:**
+```
+worker1.example.com
+worker2.example.com
+worker3.example.com
+```
+
+**Usage:**
+- Used by `start-all.sh` and `start-workers.sh`
+- Each line should contain a hostname or IP address
+- Blank lines and lines starting with `#` are ignored
+
+### conf/spark-env.sh
+
+Environment variables for Spark daemons.
+
+**Example:**
+```bash
+#!/usr/bin/env bash
+
+# Java
+export JAVA_HOME=/usr/lib/jvm/java-17
+
+# Master settings
+export SPARK_MASTER_HOST=master.example.com
+export SPARK_MASTER_PORT=7077
+export SPARK_MASTER_WEBUI_PORT=8080
+
+# Worker settings
+export SPARK_WORKER_CORES=8
+export SPARK_WORKER_MEMORY=16g
+export SPARK_WORKER_PORT=7078
+export SPARK_WORKER_WEBUI_PORT=8081
+export SPARK_WORKER_DIR=/var/spark/work
+
+# Directories
+export SPARK_LOG_DIR=/var/log/spark
+export SPARK_PID_DIR=/var/run/spark
+
+# History Server
+export 
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://namenode/spark-logs"
+
+# Additional Java options
+export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
-Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181"
+```
+
+**Key Variables:**
+
+**Master:**
+- `SPARK_MASTER_HOST`: Master hostname
+- `SPARK_MASTER_PORT`: Master port (default: 7077)
+- `SPARK_MASTER_WEBUI_PORT`: Web UI port (default: 8080)
+
+**Worker:**
+- `SPARK_WORKER_CORES`: Number of cores per worker
+- `SPARK_WORKER_MEMORY`: Memory per worker (e.g., 16g)
+- `SPARK_WORKER_PORT`: Worker communication port
+- `SPARK_WORKER_WEBUI_PORT`: Worker web UI port (default: 8081)
+- `SPARK_WORKER_DIR`: Directory for scratch space and logs
+- `SPARK_WORKER_INSTANCES`: Number of worker instances per machine
+
+**General:**
+- `SPARK_LOG_DIR`: Directory for daemon logs
+- `SPARK_PID_DIR`: Directory for PID files
+- `SPARK_IDENT_STRING`: Identifier for daemons (default: username)
+- `SPARK_NICENESS`: Nice value for daemons
+- `SPARK_DAEMON_MEMORY`: Memory for daemon processes
+
+## Setting Up a Standalone Cluster
+
+### Step 1: Install Spark on All Nodes
+
+```bash
+# Download and extract Spark on each machine
+tar xzf spark-X.Y.Z-bin-hadoopX.tgz
+cd spark-X.Y.Z-bin-hadoopX
+```
+
+### Step 2: Configure spark-env.sh
+
+Create `conf/spark-env.sh` from template:
+```bash
+cp conf/spark-env.sh.template conf/spark-env.sh
+# Edit conf/spark-env.sh with appropriate settings
+```
+
+### Step 3: Configure Workers File
+
+Create `conf/workers`:
+```bash
+cp conf/workers.template conf/workers
+# Add worker hostnames, one per line
+```
+
+### Step 4: Configure SSH Access
+
+Set up password-less SSH from master to all workers:
+```bash
+ssh-keygen -t rsa
+ssh-copy-id user@worker1
+ssh-copy-id user@worker2
+# ... for each worker
+```
+
+### Step 5: Synchronize Configuration
+
+Copy configuration to all workers:
+```bash
+for host in $(cat conf/workers); do
+  rsync -av conf/ user@$host:spark/conf/
+done
+```
+
+### Step 6: Start the Cluster
+
+```bash
+./sbin/start-all.sh
+```
+
+### Step 7: Verify
+
+- Check master UI: `http://master:8080`
+- Check worker UIs: `http://worker1:8081`, etc.
+- Look for workers registered with master
+
+## High Availability
+
+For production deployments, configure high availability with ZooKeeper.
+
+### ZooKeeper-based HA Configuration
+
+**In conf/spark-env.sh:**
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.deploy.recoveryMode=ZOOKEEPER
+  -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181,zk3:2181
+  -Dspark.deploy.zookeeper.dir=/spark
+"
+```
+
+### Start Multiple Masters
+
+```bash
+# On master1
+./sbin/start-master.sh
+
+# On master2
+./sbin/start-master.sh
+
+# On master3
+./sbin/start-master.sh
+```
+
+### Connect Workers to All Masters
+
+```bash
+./sbin/start-worker.sh spark://master1:7077,master2:7077,master3:7077
+```
+
+**Automatic failover:** If active master fails, standby masters detect the 
failure and one becomes active.
+
+## Monitoring and Logs
+
+### Log Files
+
+Daemon logs are written to `$SPARK_LOG_DIR` (default: `logs/`):
+
+```bash
+# Master log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.master.Master-*.out
+
+# Worker log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.worker.Worker-*.out
+
+# History Server log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.history.HistoryServer-*.out
+```
+
+### View Logs
+
+```bash
+# Tail master log
+tail -f logs/spark-*-master-*.out
+
+# Tail worker log
+tail -f logs/spark-*-worker-*.out
+
+# Search for errors
+grep ERROR logs/spark-*-master-*.out
+```
+
+### Web UIs
+
+- **Master UI**: `http://<master>:8080` - Cluster status, workers, applications
+- **Worker UI**: `http://<worker>:8081` - Worker status, running executors
+- **Application UI**: `http://<driver>:4040` - Running application metrics
+- **History Server**: `http://<history-server>:18080` - Completed applications
+
+## Advanced Configuration
+
+### Memory Overhead
+
+Reserve memory for system processes:
+```bash
+export SPARK_DAEMON_MEMORY=2g
+```
+
+### Multiple Workers per Machine
+
+Run multiple worker instances on a single machine:
+```bash
+export SPARK_WORKER_INSTANCES=2
+export SPARK_WORKER_CORES=4      # Cores per instance
+export SPARK_WORKER_MEMORY=8g    # Memory per instance
+```
+
+### Work Directory
+
+Change worker scratch space:
+```bash
+export SPARK_WORKER_DIR=/mnt/fast-disk/spark-work
+```
+
+### Port Configuration
+
+Use non-default ports:
+```bash
+export SPARK_MASTER_PORT=9077
+export SPARK_MASTER_WEBUI_PORT=9080
+export SPARK_WORKER_PORT=9078
+export SPARK_WORKER_WEBUI_PORT=9081
+```
+
+## Security
+
+### Enable Authentication
+
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.authenticate=true
+  -Dspark.authenticate.secret=your-secret-key
+"
+```
+
+### Enable SSL
+
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.ssl.enabled=true
+  -Dspark.ssl.keyStore=/path/to/keystore
+  -Dspark.ssl.keyStorePassword=password
+  -Dspark.ssl.trustStore=/path/to/truststore
+  -Dspark.ssl.trustStorePassword=password
+"
+```
+
+## Troubleshooting
+
+### Master Won't Start
+
+**Check:**
+1. Port 7077 is available: `netstat -an | grep 7077`
+2. Hostname is resolvable: `ping $SPARK_MASTER_HOST`
+3. Logs for errors: `cat logs/spark-*-master-*.out`
+
+### Workers Not Connecting
+
+**Check:**
+1. Master URL is correct
+2. Network connectivity: `telnet master 7077`
+3. Firewall allows connections
+4. Worker logs: `cat logs/spark-*-worker-*.out`
+
+### SSH Connection Issues
+
+**Solutions:**
+1. Verify SSH key: `ssh worker1 echo test`
+2. Check SSH config: `~/.ssh/config`
+3. Use SSH agent: `eval $(ssh-agent); ssh-add`
+
+### Insufficient Resources
+
+**Check:**
+- Worker has enough memory: `free -h`
+- Enough cores available: `nproc`
+- Disk space: `df -h`
+
+## Cluster Shutdown
+
+### Graceful Shutdown
+
+```bash
+# Stop all workers first
+./sbin/stop-workers.sh
+
+# Stop master
+./sbin/stop-master.sh
+
+# Or stop everything
+./sbin/stop-all.sh
+```
+
+### Check All Stopped
+
+```bash
+# Check for running Java processes
+jps | grep -E "(Master|Worker)"
+```
+
+### Force Kill if Needed
+
+```bash
+# Kill any remaining Spark processes
+pkill -f org.apache.spark.deploy
+```
+
+## Best Practices
+
+1. **Use HA in production**: Configure ZooKeeper-based HA
+2. **Monitor resources**: Watch CPU, memory, disk usage
+3. **Separate log directories**: Use dedicated disk for logs
+4. **Regular maintenance**: Clean old logs and application data
+5. **Automate startup**: Use systemd or init scripts
+6. **Configure limits**: Set file descriptor and process limits
+7. **Use external shuffle service**: For better fault tolerance
+8. **Back up metadata**: Regularly back up ZooKeeper data
+
+## Scripts Reference
+
+| Script | Purpose |
+|--------|---------|

Review Comment:
   Suggested fix for the table header and separator:



##########
DEVELOPMENT.md:
##########
@@ -0,0 +1,462 @@
+# Spark Development Guide
+
+This guide provides information for developers working on Apache Spark.
+
+## Table of Contents
+
+- [Getting Started](#getting-started)
+- [Development Environment](#development-environment)
+- [Building Spark](#building-spark)
+- [Testing](#testing)
+- [Code Style](#code-style)
+- [IDE Setup](#ide-setup)
+- [Debugging](#debugging)
+- [Working with Git](#working-with-git)
+- [Common Development Tasks](#common-development-tasks)
+
+## Getting Started
+
+### Prerequisites
+
+- Java 17 or Java 21 (for Spark 4.x)
+- Maven 3.9.9 or later
+- Python 3.9+ (for PySpark development)
+- R 4.0+ (for SparkR development)
+- Git
+
+### Initial Setup
+
+1. **Clone the repository:**
+   ```bash
+   git clone https://github.com/apache/spark.git
+   cd spark
+   ```
+
+2. **Build Spark:**
+   ```bash
+   ./build/mvn -DskipTests clean package
+   ```
+
+3. **Verify the build:**
+   ```bash
+   ./bin/spark-shell
+   ```
+
+## Development Environment
+
+### Directory Structure
+
+```
+spark/
+├── assembly/          # Final assembly JAR creation
+├── bin/              # User command scripts (spark-submit, spark-shell, etc.)
+├── build/            # Build scripts and Maven wrapper
+├── common/           # Common utilities and modules
+├── conf/             # Configuration templates
+├── core/             # Spark Core
+├── dev/              # Development tools (run-tests, lint, etc.)
+├── docs/             # Documentation (Jekyll-based)
+├── examples/         # Example programs
+├── python/           # PySpark implementation
+├── R/                # SparkR implementation
+├── sbin/             # Admin scripts (start-all.sh, stop-all.sh, etc.)
+├── sql/              # Spark SQL
+└── [other modules]
+```
+
+### Key Development Directories
+
+- `dev/`: Contains scripts for testing, linting, and releasing
+- `dev/run-tests`: Main test runner
+- `dev/lint-*`: Various linting tools
+- `build/mvn`: Maven wrapper script
+
+## Building Spark
+
+### Full Build
+
+```bash
+# Build all modules, skip tests
+./build/mvn -DskipTests clean package
+
+# Build with specific Hadoop version
+./build/mvn -Phadoop-3.4 -DskipTests clean package
+
+# Build with Hive support
+./build/mvn -Phive -Phive-thriftserver -DskipTests package
+```
+
+### Module-Specific Builds
+
+```bash
+# Build only core module
+./build/mvn -pl core -DskipTests package
+
+# Build core and its dependencies
+./build/mvn -pl core -am -DskipTests package
+
+# Build SQL module
+./build/mvn -pl sql/core -am -DskipTests package
+```
+
+### Build Profiles
+
+Common Maven profiles:
+
+- `-Phadoop-3.4`: Build with Hadoop 3.4
+- `-Pyarn`: Include YARN support
+- `-Pkubernetes`: Include Kubernetes support
+- `-Phive`: Include Hive support
+- `-Phive-thriftserver`: Include Hive Thrift Server
+- `-Pscala-2.13`: Build with Scala 2.13
+
+### Fast Development Builds
+
+For faster iteration during development:
+
+```bash
+# Skip Scala and Java style checks
+./build/mvn -DskipTests -Dcheckstyle.skip package
+
+# Build specific module quickly
+./build/mvn -pl sql/core -am -DskipTests -Dcheckstyle.skip package
+```
+
+## Testing
+
+### Running All Tests
+
+```bash
+# Run all tests (takes several hours)
+./dev/run-tests
+
+# Run tests for specific modules
+./dev/run-tests --modules sql
+```
+
+### Running Specific Test Suites
+
+#### Scala/Java Tests
+
+```bash
+# Run all tests in a module
+./build/mvn test -pl core
+
+# Run a specific test suite
+./build/mvn test -pl core -Dtest=SparkContextSuite
+
+# Run specific test methods
+./build/mvn test -pl core -Dtest=SparkContextSuite#testJobInterruption
+```
+
+#### Python Tests
+
+```bash
+# Run all PySpark tests
+cd python && python run-tests.py
+
+# Run specific test file
+cd python && python -m pytest pyspark/tests/test_context.py
+
+# Run specific test method
+cd python && python -m pytest 
pyspark/tests/test_context.py::SparkContextTests::test_stop
+```
+
+#### R Tests
+
+```bash
+cd R
+R CMD check --no-manual --no-build-vignettes spark
+```
+
+### Test Coverage
+
+```bash
+# Generate coverage report
+./build/mvn clean install -DskipTests
+./dev/run-tests --coverage
+```
+
+## Code Style
+
+### Scala Code Style
+
+Spark uses Scalastyle for Scala code checking:
+
+```bash
+# Check Scala style
+./dev/lint-scala
+
+# Auto-format (if scalafmt is configured)
+./build/mvn scala:format
+```
+
+Key style guidelines:
+- 2-space indentation
+- Max line length: 100 characters
+- Follow [Scala style guide](https://docs.scala-lang.org/style/)
+
+### Java Code Style
+
+Java code follows Google Java Style:
+
+```bash
+# Check Java style
+./dev/lint-java
+```
+
+Key guidelines:
+- 2-space indentation
+- Max line length: 100 characters
+- Use Java 17+ features appropriately
+
+### Python Code Style
+
+PySpark follows PEP 8:
+
+```bash
+# Check Python style
+./dev/lint-python
+
+# Auto-format with black (if available)
+cd python && black pyspark/
+```
+
+Key guidelines:
+- 4-space indentation
+- Max line length: 100 characters
+- Type hints encouraged for new code
+
+## IDE Setup
+
+### IntelliJ IDEA
+
+1. **Import Project:**
+   - File → Open → Select `pom.xml`
+   - Choose "Open as Project"
+   - Import Maven projects automatically
+
+2. **Configure JDK:**
+   - File → Project Structure → Project SDK → Select Java 17 or 21
+
+3. **Recommended Plugins:**
+   - Scala plugin
+   - Python plugin
+   - Maven plugin
+
+4. **Code Style:**
+   - Import Spark code style from `dev/scalastyle-config.xml`
+
+### Visual Studio Code
+
+1. **Recommended Extensions:**
+   - Scala (Metals)
+   - Python
+   - Maven for Java
+
+2. **Workspace Settings:**
+   ```json
+   {
+     "java.configuration.maven.userSettings": ".mvn/settings.xml",
+     "python.linting.enabled": true,
+     "python.linting.pylintEnabled": true
+   }
+   ```
+
+### Eclipse
+
+1. **Import Project:**
+   - File → Import → Maven → Existing Maven Projects
+
+2. **Install Plugins:**
+   - Scala IDE
+   - Maven Integration
+
+## Debugging
+
+### Debugging Scala/Java Code
+
+#### Using IDE Debugger
+
+1. Run tests with debugging enabled in your IDE
+2. Set breakpoints in source code
+3. Run test in debug mode
+
+#### Command Line Debugging
+
+```bash
+# Enable remote debugging
+export 
SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"

Review Comment:
   For spark-shell and spark-submit, SPARK_SUBMIT_OPTS is the standard way to 
pass JVM options; SPARK_JAVA_OPTS may not be respected. Recommend replacing 
with SPARK_SUBMIT_OPTS, e.g.: export 
SPARK_SUBMIT_OPTS='-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=*:5005'.
   ```suggestion
   export 
SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"
   ```



##########
graphx/README.md:
##########
@@ -0,0 +1,549 @@
+# GraphX
+
+GraphX is Apache Spark's API for graphs and graph-parallel computation.
+
+## Overview
+
+GraphX unifies ETL (Extract, Transform, and Load), exploratory analysis, and 
iterative graph computation within a single system. It provides:
+
+- **Graph Abstraction**: Efficient representation of property graphs
+- **Graph Algorithms**: PageRank, Connected Components, Triangle Counting, and 
more
+- **Pregel API**: For iterative graph computations
+- **Graph Builders**: Tools to construct graphs from RDDs or files
+- **Graph Operators**: Transformations and structural operations
+
+## Key Concepts
+
+### Property Graph
+
+A directed multigraph with properties attached to each vertex and edge.
+
+**Components:**
+- **Vertices**: Nodes with unique IDs and properties
+- **Edges**: Directed connections between vertices with properties
+- **Triplets**: A view joining vertices and edges
+
+```scala
+import org.apache.spark.graphx._
+
+// Create vertices RDD
+val vertices: RDD[(VertexId, String)] = sc.parallelize(Array(
+  (1L, "Alice"),
+  (2L, "Bob"),
+  (3L, "Charlie")
+))
+
+// Create edges RDD
+val edges: RDD[Edge[String]] = sc.parallelize(Array(
+  Edge(1L, 2L, "friend"),
+  Edge(2L, 3L, "follow")
+))
+
+// Build the graph
+val graph: Graph[String, String] = Graph(vertices, edges)
+```
+
+### Graph Structure
+
+```
+Graph[VD, ED]
+  - vertices: VertexRDD[VD]  // Vertices with properties of type VD
+  - edges: EdgeRDD[ED]        // Edges with properties of type ED
+  - triplets: RDD[EdgeTriplet[VD, ED]]  // Combined view
+```
+
+## Core Components
+
+### Graph Class
+
+The main graph abstraction.
+
+**Location**: `src/main/scala/org/apache/spark/graphx/Graph.scala`
+
+**Key methods:**
+- `vertices: VertexRDD[VD]`: Access vertices
+- `edges: EdgeRDD[ED]`: Access edges
+- `triplets: RDD[EdgeTriplet[VD, ED]]`: Access triplets
+- `mapVertices[VD2](map: (VertexId, VD) => VD2)`: Transform vertex properties
+- `mapEdges[ED2](map: Edge[ED] => ED2)`: Transform edge properties
+- `subgraph(epred, vpred)`: Create subgraph based on predicates
+
+### VertexRDD
+
+Optimized RDD for vertex data.
+
+**Location**: `src/main/scala/org/apache/spark/graphx/VertexRDD.scala`
+
+**Features:**
+- Fast lookups by vertex ID
+- Efficient joins with edge data
+- Reuse of vertex indices
+
+### EdgeRDD
+
+Optimized RDD for edge data.
+
+**Location**: `src/main/scala/org/apache/spark/graphx/EdgeRDD.scala`
+
+**Features:**
+- Compact edge storage
+- Fast filtering and mapping
+- Efficient partitioning
+
+### EdgeTriplet
+
+Represents a edge with its source and destination vertex properties.

Review Comment:
   Fix the grammatical error: 'a edge' should be 'an edge'.
   ```suggestion
   Represents an edge with its source and destination vertex properties.
   ```



##########
streaming/README.md:
##########
@@ -0,0 +1,430 @@
+# Spark Streaming
+
+Spark Streaming provides scalable, high-throughput, fault-tolerant stream 
processing of live data streams.
+
+## Overview
+
+Spark Streaming supports two APIs:
+
+1. **DStreams (Discretized Streams)** - Legacy API (Deprecated as of Spark 3.4)
+2. **Structured Streaming** - Modern API built on Spark SQL (Recommended)
+
+**Note**: DStreams are deprecated. For new applications, use **Structured 
Streaming** which is located in the `sql/core` module.
+
+## DStreams (Legacy API)
+
+### What are DStreams?
+
+DStreams represent a continuous stream of data, internally represented as a 
sequence of RDDs.
+
+**Key characteristics:**
+- Micro-batch processing model
+- Integration with Kafka, Flume, Kinesis, TCP sockets, and more
+- Windowing operations for time-based aggregations
+- Stateful transformations with updateStateByKey
+- Fault tolerance through checkpointing
+
+### Location
+
+- Scala/Java: `src/main/scala/org/apache/spark/streaming/`
+- Python: `../python/pyspark/streaming/`
+
+### Basic Example
+
+```scala
+import org.apache.spark.streaming._
+import org.apache.spark.SparkConf
+
+val conf = new SparkConf().setAppName("NetworkWordCount")
+val ssc = new StreamingContext(conf, Seconds(1))
+
+// Create DStream from TCP source
+val lines = ssc.socketTextStream("localhost", 9999)
+
+// Process the stream
+val words = lines.flatMap(_.split(" "))
+val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
+
+// Print results
+wordCounts.print()
+
+// Start the computation
+ssc.start()
+ssc.awaitTermination()
+```
+
+### Key Components
+
+#### StreamingContext
+
+The main entry point for streaming functionality.
+
+**File**: `src/main/scala/org/apache/spark/streaming/StreamingContext.scala`
+
+**Usage:**
+```scala
+val ssc = new StreamingContext(sparkContext, Seconds(batchInterval))
+// or
+val ssc = new StreamingContext(conf, Seconds(batchInterval))
+```
+
+#### DStream
+
+The fundamental abstraction for a continuous data stream.
+
+**File**: `src/main/scala/org/apache/spark/streaming/dstream/DStream.scala`
+
+**Operations:**
+- **Transformations**: map, flatMap, filter, reduce, join, window
+- **Output Operations**: print, saveAsTextFiles, foreachRDD
+
+#### Input Sources
+
+**Built-in sources:**
+- `socketTextStream`: TCP socket source
+- `textFileStream`: File system monitoring
+- `queueStream`: Queue-based testing source
+
+**Advanced sources** (require external libraries):
+- Kafka: `KafkaUtils.createStream`
+- Flume: `FlumeUtils.createStream`
+- Kinesis: `KinesisUtils.createStream`
+
+**Location**: `src/main/scala/org/apache/spark/streaming/dstream/`
+
+### Windowing Operations
+
+Process data over sliding windows:
+
+```scala
+val windowedStream = lines
+  .window(Seconds(30), Seconds(10))  // 30s window, 10s slide
+  
+val windowedWordCounts = words
+  .map(x => (x, 1))
+  .reduceByKeyAndWindow(_ + _, Seconds(30), Seconds(10))
+```
+
+### Stateful Operations
+
+Maintain state across batches:
+
+```scala
+def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): 
Option[Int] = {
+  val newCount = runningCount.getOrElse(0) + newValues.sum
+  Some(newCount)
+}
+
+val runningCounts = pairs.updateStateByKey(updateFunction)
+```
+
+### Checkpointing
+
+Essential for stateful operations and fault tolerance:
+
+```scala
+ssc.checkpoint("hdfs://checkpoint/directory")
+```
+
+**What gets checkpointed:**
+- Configuration
+- DStream operations
+- Incomplete batches
+- State data (for stateful operations)
+
+### Performance Tuning
+
+**Batch Interval**
+- Set based on processing time and latency requirements
+- Too small: overhead increases
+- Too large: latency increases
+
+**Parallelism**
+```scala
+// Increase receiver parallelism
+val numStreams = 5
+val streams = (1 to numStreams).map(_ => ssc.socketTextStream(...))
+val unifiedStream = ssc.union(streams)
+
+// Repartition for processing
+val repartitioned = dstream.repartition(10)
+```
+
+**Memory Management**
+```scala
+conf.set("spark.streaming.receiver.maxRate", "10000")
+conf.set("spark.streaming.kafka.maxRatePerPartition", "1000")
+```
+
+## Structured Streaming (Recommended)
+
+For new applications, use Structured Streaming instead of DStreams.
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/sql/streaming/`
+
+**Example:**
+```scala
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.streaming._
+
+val spark = SparkSession.builder()
+  .appName("StructuredNetworkWordCount")
+  .getOrCreate()
+
+import spark.implicits._
+
+// Create DataFrame from stream source
+val lines = spark
+  .readStream
+  .format("socket")
+  .option("host", "localhost")
+  .option("port", 9999)
+  .load()
+
+// Process the stream
+val words = lines.as[String].flatMap(_.split(" "))
+val wordCounts = words.groupBy("value").count()
+
+// Output the stream
+val query = wordCounts
+  .writeStream
+  .outputMode("complete")
+  .format("console")
+  .start()
+
+query.awaitTermination()
+```
+
+**Advantages over DStreams:**
+- Unified API with batch processing
+- Better performance with Catalyst optimizer
+- Exactly-once semantics
+- Event time processing
+- Watermarking for late data
+- Easier to reason about
+
+See [Structured Streaming 
Guide](../docs/structured-streaming-programming-guide.md) for details.
+
+## Building and Testing
+
+### Build Streaming Module
+
+```bash
+# Build streaming module
+./build/mvn -pl streaming -am package
+
+# Skip tests
+./build/mvn -pl streaming -am -DskipTests package
+```
+
+### Run Tests
+
+```bash
+# Run all streaming tests
+./build/mvn test -pl streaming
+
+# Run specific test suite
+./build/mvn test -pl streaming -Dtest=BasicOperationsSuite
+```
+
+## Source Code Organization
+
+```
+streaming/src/main/
+├── scala/org/apache/spark/streaming/
+│   ├── StreamingContext.scala           # Main entry point
+│   ├── Time.scala                       # Time utilities
+│   ├── Checkpoint.scala                 # Checkpointing
+│   ├── dstream/
+│   │   ├── DStream.scala               # Base DStream
+│   │   ├── InputDStream.scala          # Input sources
+│   │   ├── ReceiverInputDStream.scala  # Receiver-based input
+│   │   ├── WindowedDStream.scala       # Windowing operations
+│   │   ├── StateDStream.scala          # Stateful operations
+│   │   └── PairDStreamFunctions.scala  # Key-value operations
+│   ├── receiver/
+│   │   ├── Receiver.scala              # Base receiver class
+│   │   ├── ReceiverSupervisor.scala    # Receiver management
+│   │   └── BlockGenerator.scala        # Block generation
+│   ├── scheduler/
+│   │   ├── JobScheduler.scala          # Job scheduling
+│   │   ├── JobGenerator.scala          # Job generation
+│   │   └── ReceiverTracker.scala       # Receiver tracking
+│   └── ui/
+│       └── StreamingTab.scala          # Web UI
+└── resources/
+```
+
+## Integration with External Systems
+
+### Apache Kafka
+
+**Deprecated DStreams approach:**
+```scala
+import org.apache.spark.streaming.kafka010._
+
+val kafkaParams = Map[String, Object](
+  "bootstrap.servers" -> "localhost:9092",
+  "key.deserializer" -> classOf[StringDeserializer],
+  "value.deserializer" -> classOf[StringDeserializer],
+  "group.id" -> "test-group"
+)
+
+val stream = KafkaUtils.createDirectStream[String, String](
+  ssc,
+  PreferConsistent,
+  Subscribe[String, String](topics, kafkaParams)
+)
+```
+
+**Recommended Structured Streaming approach:**
+```scala
+val df = spark
+  .readStream
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "localhost:9092")
+  .option("subscribe", "topic1")
+  .load()
+```
+
+See [Kafka Integration Guide](../docs/streaming-kafka-integration.md).
+
+### Amazon Kinesis
+
+```scala
+import org.apache.spark.streaming.kinesis._
+
+val stream = KinesisInputDStream.builder
+  .streamingContext(ssc)
+  .endpointUrl("https://kinesis.us-east-1.amazonaws.com";)
+  .regionName("us-east-1")
+  .streamName("myStream")
+  .build()
+```
+
+See [Kinesis Integration Guide](../docs/streaming-kinesis-integration.md).
+
+## Monitoring and Debugging
+
+### Streaming UI
+
+Access at: `http://<driver-node>:4040/streaming/`
+
+**Metrics:**
+- Batch processing times
+- Input rates
+- Scheduling delays
+- Active batches
+
+### Logs
+
+Enable detailed logging:
+```properties
+log4j.logger.org.apache.spark.streaming=DEBUG
+```
+
+### Metrics
+
+Key metrics to monitor:
+- **Batch Processing Time**: Should be < batch interval
+- **Scheduling Delay**: Should be minimal
+- **Total Delay**: End-to-end delay
+- **Input Rate**: Records per second
+
+## Common Issues
+
+### Batch Processing Time > Batch Interval
+
+**Symptoms**: Scheduling delay increases over time
+
+**Solutions:**
+- Increase parallelism
+- Optimize transformations
+- Increase resources (executors, memory)
+- Reduce batch interval data volume
+
+### Out of Memory Errors
+
+**Solutions:**
+- Increase executor memory
+- Enable compression
+- Reduce window/batch size
+- Persist less data
+
+### Receiver Failures
+
+**Solutions:**
+- Enable WAL (Write-Ahead Logs)
+- Increase receiver memory
+- Add multiple receivers
+- Use Structured Streaming with better fault tolerance
+
+## Migration from DStreams to Structured Streaming
+
+**Why migrate:**
+- DStreams are deprecated
+- Better performance and semantics
+- Unified API with batch processing
+- Active development and support
+
+**Key differences:**
+- DataFrame/Dataset API instead of RDDs
+- Declarative operations
+- Built-in support for event time
+- Exactly-once semantics by default
+
+**Migration guide**: See [Structured Streaming Migration 
Guide](../docs/ss-migration-guide.md)
+
+## Examples
+
+See 
[examples/src/main/scala/org/apache/spark/examples/streaming/](../examples/src/main/scala/org/apache/spark/examples/streaming/)
 for more examples.
+
+**Key examples:**
+- `NetworkWordCount.scala`: Basic word count
+- `StatefulNetworkWordCount.scala`: Stateful processing
+- `WindowedNetworkWordCount.scala`: Window operations
+- `KafkaWordCount.scala`: Kafka integration
+
+## Configuration
+
+Key configuration parameters:
+

Review Comment:
   spark.streaming.checkpoint.interval is not a valid Spark configuration. 
Checkpoint interval for DStreams is set per DStream in code via 
dstream.checkpoint(Seconds(n)). Please remove this property from the 
configuration block and, if desired, add a code example showing 
dstream.checkpoint(Seconds(10)).



##########
mllib/README.md:
##########
@@ -0,0 +1,514 @@
+# MLlib - Machine Learning Library
+
+MLlib is Apache Spark's scalable machine learning library.
+
+## Overview
+
+MLlib provides:
+
+- **ML Algorithms**: Classification, regression, clustering, collaborative 
filtering
+- **Featurization**: Feature extraction, transformation, dimensionality 
reduction, selection
+- **Pipelines**: Tools for constructing, evaluating, and tuning ML workflows
+- **Utilities**: Linear algebra, statistics, data handling
+
+## Important Note
+
+MLlib includes two packages:
+
+1. **`spark.ml`** (DataFrame-based API) - **Primary API** (Recommended)
+2. **`spark.mllib`** (RDD-based API) - **Maintenance mode only**
+
+The RDD-based API (`spark.mllib`) is in maintenance mode. The DataFrame-based 
API (`spark.ml`) is the primary API and is recommended for all new applications.
+
+## Package Structure
+
+### spark.ml (Primary API)
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/`
+
+DataFrame-based API with:
+- **ML Pipeline API**: For building ML workflows
+- **Transformers**: Feature transformers
+- **Estimators**: Learning algorithms
+- **Models**: Fitted models
+
+```scala
+import org.apache.spark.ml.classification.LogisticRegression
+import org.apache.spark.ml.feature.VectorAssembler
+
+// Create pipeline
+val assembler = new VectorAssembler()
+  .setInputCols(Array("feature1", "feature2"))
+  .setOutputCol("features")
+
+val lr = new LogisticRegression()
+  .setMaxIter(10)
+
+val pipeline = new Pipeline().setStages(Array(assembler, lr))
+
+// Fit model
+val model = pipeline.fit(trainingData)
+
+// Make predictions
+val predictions = model.transform(testData)
+```
+
+### spark.mllib (RDD-based API - Maintenance Mode)
+
+**Location**: `src/main/scala/org/apache/spark/mllib/`
+
+RDD-based API with:
+- Classic algorithms using RDDs
+- Maintained for backward compatibility
+- No new features added
+
+```scala
+import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
+import org.apache.spark.mllib.regression.LabeledPoint
+
+// Train model (old API)
+val data: RDD[LabeledPoint] = ...
+val model = LogisticRegressionWithLBFGS.train(data)
+
+// Make predictions
+val predictions = data.map { point => model.predict(point.features) }
+```
+
+## Key Concepts
+
+### Pipeline API (spark.ml)
+
+Machine learning pipelines provide:
+
+1. **DataFrame**: Unified data representation
+2. **Transformer**: Algorithms that transform DataFrames
+3. **Estimator**: Algorithms that fit on DataFrames to produce Transformers
+4. **Pipeline**: Chains multiple Transformers and Estimators
+5. **Parameter**: Common API for specifying parameters
+
+**Example Pipeline:**
+```scala
+import org.apache.spark.ml.{Pipeline, PipelineModel}
+import org.apache.spark.ml.classification.LogisticRegression
+import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
+
+// Configure pipeline stages
+val tokenizer = new Tokenizer()
+  .setInputCol("text")
+  .setOutputCol("words")
+
+val hashingTF = new HashingTF()
+  .setInputCol("words")
+  .setOutputCol("features")
+
+val lr = new LogisticRegression()
+  .setMaxIter(10)
+
+val pipeline = new Pipeline()
+  .setStages(Array(tokenizer, hashingTF, lr))
+
+// Fit the pipeline
+val model = pipeline.fit(trainingData)
+
+// Make predictions
+model.transform(testData)
+```
+
+### Transformers
+
+Algorithms that transform one DataFrame into another.
+
+**Examples:**
+- `Tokenizer`: Splits text into words
+- `HashingTF`: Maps word sequences to feature vectors
+- `StandardScaler`: Normalizes features
+- `VectorAssembler`: Combines multiple columns into a vector
+- `PCA`: Dimensionality reduction
+
+### Estimators
+
+Algorithms that fit on a DataFrame to produce a Transformer.
+
+**Examples:**
+- `LogisticRegression`: Produces LogisticRegressionModel
+- `DecisionTreeClassifier`: Produces DecisionTreeClassificationModel
+- `KMeans`: Produces KMeansModel
+- `StringIndexer`: Produces StringIndexerModel
+
+## ML Algorithms
+
+### Classification
+
+**Binary and Multiclass:**
+- Logistic Regression
+- Decision Tree Classifier
+- Random Forest Classifier
+- Gradient-Boosted Tree Classifier
+- Naive Bayes
+- Linear Support Vector Machine
+
+**Multilabel:**
+- OneVsRest
+
+**Example:**
+```scala
+import org.apache.spark.ml.classification.LogisticRegression
+
+val lr = new LogisticRegression()
+  .setMaxIter(10)
+  .setRegParam(0.3)
+  .setElasticNetParam(0.8)
+
+val model = lr.fit(trainingData)
+val predictions = model.transform(testData)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/classification/`
+
+### Regression
+
+- Linear Regression
+- Generalized Linear Regression
+- Decision Tree Regression
+- Random Forest Regression
+- Gradient-Boosted Tree Regression
+- Survival Regression (AFT)
+- Isotonic Regression
+
+**Example:**
+```scala
+import org.apache.spark.ml.regression.LinearRegression
+
+val lr = new LinearRegression()
+  .setMaxIter(10)
+  .setRegParam(0.3)
+  .setElasticNetParam(0.8)
+
+val model = lr.fit(trainingData)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/regression/`
+
+### Clustering
+
+- K-means
+- Latent Dirichlet Allocation (LDA)
+- Bisecting K-means
+- Gaussian Mixture Model (GMM)
+
+**Example:**
+```scala
+import org.apache.spark.ml.clustering.KMeans
+
+val kmeans = new KMeans()
+  .setK(3)
+  .setSeed(1L)
+
+val model = kmeans.fit(dataset)
+val predictions = model.transform(dataset)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/clustering/`
+
+### Collaborative Filtering
+
+Alternating Least Squares (ALS) for recommendation systems.
+
+**Example:**
+```scala
+import org.apache.spark.ml.recommendation.ALS
+
+val als = new ALS()
+  .setMaxIter(10)
+  .setRegParam(0.01)
+  .setUserCol("userId")
+  .setItemCol("movieId")
+  .setRatingCol("rating")
+
+val model = als.fit(ratings)
+val predictions = model.transform(testData)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/recommendation/`
+
+## Feature Engineering
+
+### Feature Extractors
+
+- `TF-IDF`: Text feature extraction
+- `Word2Vec`: Word embeddings
+- `CountVectorizer`: Converts text to vectors of token counts
+
+### Feature Transformers
+
+- `Tokenizer`: Text tokenization
+- `StopWordsRemover`: Removes stop words
+- `StringIndexer`: Encodes string labels to indices
+- `IndexToString`: Converts indices back to strings
+- `OneHotEncoder`: One-hot encoding
+- `VectorAssembler`: Combines columns into feature vector
+- `StandardScaler`: Standardizes features
+- `MinMaxScaler`: Scales features to a range
+- `Normalizer`: Normalizes vectors to unit norm
+- `Binarizer`: Binarizes based on threshold
+
+### Feature Selectors
+
+- `VectorSlicer`: Extracts subset of features
+- `RFormula`: R model formula for feature specification
+- `ChiSqSelector`: Chi-square feature selection
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/feature/`
+
+## Model Selection and Tuning
+
+### Cross-Validation
+
+```scala
+import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
+import org.apache.spark.ml.evaluation.RegressionEvaluator
+
+val paramGrid = new ParamGridBuilder()
+  .addGrid(lr.regParam, Array(0.1, 0.01))
+  .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
+  .build()
+
+val cv = new CrossValidator()
+  .setEstimator(lr)
+  .setEvaluator(new RegressionEvaluator())
+  .setEstimatorParamMaps(paramGrid)
+  .setNumFolds(3)
+
+val cvModel = cv.fit(trainingData)
+```
+
+### Train-Validation Split
+
+```scala
+import org.apache.spark.ml.tuning.TrainValidationSplit
+
+val trainValidationSplit = new TrainValidationSplit()
+  .setEstimator(lr)
+  .setEvaluator(new RegressionEvaluator())
+  .setEstimatorParamMaps(paramGrid)
+  .setTrainRatio(0.8)
+
+val model = trainValidationSplit.fit(trainingData)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/tuning/`
+
+## Evaluation Metrics
+
+### Classification
+
+```scala
+import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
+
+val evaluator = new MulticlassClassificationEvaluator()
+  .setLabelCol("label")
+  .setPredictionCol("prediction")
+  .setMetricName("accuracy")
+
+val accuracy = evaluator.evaluate(predictions)
+```
+
+### Regression
+
+```scala
+import org.apache.spark.ml.evaluation.RegressionEvaluator
+
+val evaluator = new RegressionEvaluator()
+  .setLabelCol("label")
+  .setPredictionCol("prediction")
+  .setMetricName("rmse")
+
+val rmse = evaluator.evaluate(predictions)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/evaluation/`
+
+## Linear Algebra
+
+MLlib provides distributed linear algebra through Breeze.
+
+**Location**: `src/main/scala/org/apache/spark/mllib/linalg/`

Review Comment:
   This section mixes package paths: the stated location points to mllib/linalg 
(RDD-based), while the example imports org.apache.spark.ml.linalg 
(DataFrame-based, located under mllib-local). Please clarify by either updating 
the location to mllib-local/src/main/scala/org/apache/spark/ml/linalg for 
ml.linalg types, or change the example to import org.apache.spark.mllib.linalg 
if you intend to reference the RDD-based API.
   ```suggestion
   **Location**: `mllib-local/src/main/scala/org/apache/spark/ml/linalg/`
   ```



##########
sbin/README.md:
##########
@@ -0,0 +1,514 @@
+# Spark Admin Scripts
+
+This directory contains administrative scripts for managing Spark standalone 
clusters.
+
+## Overview
+
+The `sbin/` scripts are used by cluster administrators to:
+- Start and stop Spark standalone clusters
+- Start and stop individual daemons (master, workers, history server)
+- Manage cluster lifecycle
+- Configure cluster nodes
+
+**Note**: These scripts are for **Spark Standalone** cluster mode only. For 
YARN, Kubernetes, or Mesos, use their respective cluster management tools.
+
+## Cluster Management Scripts
+
+### start-all.sh / stop-all.sh
+
+Start or stop all Spark daemons on the cluster.
+
+**Usage:**
+```bash
+# Start master and all workers
+./sbin/start-all.sh
+
+# Stop all daemons
+./sbin/stop-all.sh
+```
+
+**What they do:**
+- `start-all.sh`: Starts master on the current machine and workers on machines 
listed in `conf/workers`
+- `stop-all.sh`: Stops all master and worker daemons
+
+**Prerequisites:**
+- SSH key-based authentication configured
+- `conf/workers` file with worker hostnames
+- Spark installed at same location on all machines
+
+**Configuration files:**
+- `conf/workers`: List of worker hostnames (one per line)
+- `conf/spark-env.sh`: Environment variables
+
+### start-master.sh / stop-master.sh
+
+Start or stop the Spark master daemon on the current machine.
+
+**Usage:**
+```bash
+# Start master
+./sbin/start-master.sh
+
+# Stop master
+./sbin/stop-master.sh
+```
+
+**Master Web UI**: Access at `http://<master-host>:8080/`
+
+**Configuration:**
+```bash
+# In conf/spark-env.sh
+export SPARK_MASTER_HOST=master-hostname
+export SPARK_MASTER_PORT=7077
+export SPARK_MASTER_WEBUI_PORT=8080
+```
+
+### start-worker.sh / stop-worker.sh
+
+Start or stop a Spark worker daemon on the current machine.
+
+**Usage:**
+```bash
+# Start worker connecting to master
+./sbin/start-worker.sh spark://master:7077
+
+# Stop worker
+./sbin/stop-worker.sh
+```
+
+**Worker Web UI**: Access at `http://<worker-host>:8081/`
+
+**Configuration:**
+```bash
+# In conf/spark-env.sh
+export SPARK_WORKER_CORES=8      # Number of cores to use
+export SPARK_WORKER_MEMORY=16g   # Memory to allocate
+export SPARK_WORKER_PORT=7078    # Worker port
+export SPARK_WORKER_WEBUI_PORT=8081
+export SPARK_WORKER_DIR=/var/spark/work  # Work directory
+```
+
+### start-workers.sh / stop-workers.sh
+
+Start or stop workers on all machines listed in `conf/workers`.
+
+**Usage:**
+```bash
+# Start all workers
+./sbin/start-workers.sh spark://master:7077
+
+# Stop all workers
+./sbin/stop-workers.sh
+```
+
+**Requirements:**
+- `conf/workers` file configured
+- SSH access to all worker machines
+- Master URL (for starting)
+
+## History Server Scripts
+
+### start-history-server.sh / stop-history-server.sh
+
+Start or stop the Spark History Server for viewing completed application logs.
+
+**Usage:**
+```bash
+# Start history server
+./sbin/start-history-server.sh
+
+# Stop history server
+./sbin/stop-history-server.sh
+```
+
+**History Server UI**: Access at `http://<host>:18080/`
+
+**Configuration:**
+```properties
+# In conf/spark-defaults.conf
+spark.history.fs.logDirectory=hdfs://namenode/spark-logs
+spark.history.ui.port=18080
+spark.eventLog.enabled=true
+spark.eventLog.dir=hdfs://namenode/spark-logs
+```
+
+**Requirements:**
+- Applications must have event logging enabled
+- Log directory must be accessible
+
+## Shuffle Service Scripts
+
+### start-shuffle-service.sh / stop-shuffle-service.sh
+
+Start or stop the external shuffle service (for YARN).
+
+**Usage:**
+```bash
+# Start shuffle service
+./sbin/start-shuffle-service.sh
+
+# Stop shuffle service
+./sbin/stop-shuffle-service.sh
+```
+
+**Note**: Typically used only when running on YARN without the YARN auxiliary 
service.
+
+## Configuration Files
+
+### conf/workers
+
+Lists worker hostnames, one per line.
+
+**Example:**
+```
+worker1.example.com
+worker2.example.com
+worker3.example.com
+```
+
+**Usage:**
+- Used by `start-all.sh` and `start-workers.sh`
+- Each line should contain a hostname or IP address
+- Blank lines and lines starting with `#` are ignored
+
+### conf/spark-env.sh
+
+Environment variables for Spark daemons.
+
+**Example:**
+```bash
+#!/usr/bin/env bash
+
+# Java
+export JAVA_HOME=/usr/lib/jvm/java-17
+
+# Master settings
+export SPARK_MASTER_HOST=master.example.com
+export SPARK_MASTER_PORT=7077
+export SPARK_MASTER_WEBUI_PORT=8080
+
+# Worker settings
+export SPARK_WORKER_CORES=8
+export SPARK_WORKER_MEMORY=16g
+export SPARK_WORKER_PORT=7078
+export SPARK_WORKER_WEBUI_PORT=8081
+export SPARK_WORKER_DIR=/var/spark/work
+
+# Directories
+export SPARK_LOG_DIR=/var/log/spark
+export SPARK_PID_DIR=/var/run/spark
+
+# History Server
+export 
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://namenode/spark-logs"
+
+# Additional Java options
+export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
-Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181"
+```
+
+**Key Variables:**
+
+**Master:**
+- `SPARK_MASTER_HOST`: Master hostname
+- `SPARK_MASTER_PORT`: Master port (default: 7077)
+- `SPARK_MASTER_WEBUI_PORT`: Web UI port (default: 8080)
+
+**Worker:**
+- `SPARK_WORKER_CORES`: Number of cores per worker
+- `SPARK_WORKER_MEMORY`: Memory per worker (e.g., 16g)
+- `SPARK_WORKER_PORT`: Worker communication port
+- `SPARK_WORKER_WEBUI_PORT`: Worker web UI port (default: 8081)
+- `SPARK_WORKER_DIR`: Directory for scratch space and logs
+- `SPARK_WORKER_INSTANCES`: Number of worker instances per machine
+
+**General:**
+- `SPARK_LOG_DIR`: Directory for daemon logs
+- `SPARK_PID_DIR`: Directory for PID files
+- `SPARK_IDENT_STRING`: Identifier for daemons (default: username)
+- `SPARK_NICENESS`: Nice value for daemons
+- `SPARK_DAEMON_MEMORY`: Memory for daemon processes
+
+## Setting Up a Standalone Cluster
+
+### Step 1: Install Spark on All Nodes
+
+```bash
+# Download and extract Spark on each machine
+tar xzf spark-X.Y.Z-bin-hadoopX.tgz
+cd spark-X.Y.Z-bin-hadoopX
+```
+
+### Step 2: Configure spark-env.sh
+
+Create `conf/spark-env.sh` from template:
+```bash
+cp conf/spark-env.sh.template conf/spark-env.sh
+# Edit conf/spark-env.sh with appropriate settings
+```
+
+### Step 3: Configure Workers File
+
+Create `conf/workers`:
+```bash
+cp conf/workers.template conf/workers
+# Add worker hostnames, one per line
+```
+
+### Step 4: Configure SSH Access
+
+Set up password-less SSH from master to all workers:
+```bash
+ssh-keygen -t rsa
+ssh-copy-id user@worker1
+ssh-copy-id user@worker2
+# ... for each worker
+```
+
+### Step 5: Synchronize Configuration
+
+Copy configuration to all workers:
+```bash
+for host in $(cat conf/workers); do
+  rsync -av conf/ user@$host:spark/conf/
+done
+```
+
+### Step 6: Start the Cluster
+
+```bash
+./sbin/start-all.sh
+```
+
+### Step 7: Verify
+
+- Check master UI: `http://master:8080`
+- Check worker UIs: `http://worker1:8081`, etc.
+- Look for workers registered with master
+
+## High Availability
+
+For production deployments, configure high availability with ZooKeeper.
+
+### ZooKeeper-based HA Configuration
+
+**In conf/spark-env.sh:**
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.deploy.recoveryMode=ZOOKEEPER
+  -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181,zk3:2181
+  -Dspark.deploy.zookeeper.dir=/spark
+"
+```
+
+### Start Multiple Masters
+
+```bash
+# On master1
+./sbin/start-master.sh
+
+# On master2
+./sbin/start-master.sh
+
+# On master3
+./sbin/start-master.sh
+```
+
+### Connect Workers to All Masters
+
+```bash
+./sbin/start-worker.sh spark://master1:7077,master2:7077,master3:7077
+```
+
+**Automatic failover:** If active master fails, standby masters detect the 
failure and one becomes active.
+
+## Monitoring and Logs
+
+### Log Files
+
+Daemon logs are written to `$SPARK_LOG_DIR` (default: `logs/`):
+
+```bash
+# Master log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.master.Master-*.out
+
+# Worker log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.worker.Worker-*.out
+
+# History Server log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.history.HistoryServer-*.out
+```
+
+### View Logs
+
+```bash
+# Tail master log
+tail -f logs/spark-*-master-*.out
+
+# Tail worker log
+tail -f logs/spark-*-worker-*.out
+
+# Search for errors
+grep ERROR logs/spark-*-master-*.out
+```
+
+### Web UIs
+
+- **Master UI**: `http://<master>:8080` - Cluster status, workers, applications
+- **Worker UI**: `http://<worker>:8081` - Worker status, running executors
+- **Application UI**: `http://<driver>:4040` - Running application metrics
+- **History Server**: `http://<history-server>:18080` - Completed applications
+
+## Advanced Configuration
+
+### Memory Overhead
+
+Reserve memory for system processes:
+```bash
+export SPARK_DAEMON_MEMORY=2g
+```
+
+### Multiple Workers per Machine
+
+Run multiple worker instances on a single machine:
+```bash
+export SPARK_WORKER_INSTANCES=2
+export SPARK_WORKER_CORES=4      # Cores per instance
+export SPARK_WORKER_MEMORY=8g    # Memory per instance
+```
+
+### Work Directory
+
+Change worker scratch space:
+```bash
+export SPARK_WORKER_DIR=/mnt/fast-disk/spark-work
+```
+
+### Port Configuration
+
+Use non-default ports:
+```bash
+export SPARK_MASTER_PORT=9077
+export SPARK_MASTER_WEBUI_PORT=9080
+export SPARK_WORKER_PORT=9078
+export SPARK_WORKER_WEBUI_PORT=9081
+```
+
+## Security
+
+### Enable Authentication
+
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.authenticate=true
+  -Dspark.authenticate.secret=your-secret-key
+"
+```
+
+### Enable SSL
+
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.ssl.enabled=true
+  -Dspark.ssl.keyStore=/path/to/keystore
+  -Dspark.ssl.keyStorePassword=password
+  -Dspark.ssl.trustStore=/path/to/truststore
+  -Dspark.ssl.trustStorePassword=password
+"
+```
+
+## Troubleshooting
+
+### Master Won't Start
+
+**Check:**
+1. Port 7077 is available: `netstat -an | grep 7077`
+2. Hostname is resolvable: `ping $SPARK_MASTER_HOST`
+3. Logs for errors: `cat logs/spark-*-master-*.out`
+
+### Workers Not Connecting
+
+**Check:**
+1. Master URL is correct
+2. Network connectivity: `telnet master 7077`
+3. Firewall allows connections
+4. Worker logs: `cat logs/spark-*-worker-*.out`
+
+### SSH Connection Issues
+
+**Solutions:**
+1. Verify SSH key: `ssh worker1 echo test`
+2. Check SSH config: `~/.ssh/config`
+3. Use SSH agent: `eval $(ssh-agent); ssh-add`
+
+### Insufficient Resources
+
+**Check:**
+- Worker has enough memory: `free -h`
+- Enough cores available: `nproc`
+- Disk space: `df -h`
+
+## Cluster Shutdown
+
+### Graceful Shutdown
+
+```bash
+# Stop all workers first
+./sbin/stop-workers.sh
+
+# Stop master
+./sbin/stop-master.sh
+
+# Or stop everything
+./sbin/stop-all.sh
+```
+
+### Check All Stopped
+
+```bash
+# Check for running Java processes
+jps | grep -E "(Master|Worker)"
+```
+
+### Force Kill if Needed
+
+```bash
+# Kill any remaining Spark processes
+pkill -f org.apache.spark.deploy
+```
+
+## Best Practices
+
+1. **Use HA in production**: Configure ZooKeeper-based HA
+2. **Monitor resources**: Watch CPU, memory, disk usage
+3. **Separate log directories**: Use dedicated disk for logs
+4. **Regular maintenance**: Clean old logs and application data
+5. **Automate startup**: Use systemd or init scripts
+6. **Configure limits**: Set file descriptor and process limits
+7. **Use external shuffle service**: For better fault tolerance
+8. **Back up metadata**: Regularly back up ZooKeeper data
+
+## Scripts Reference
+
+| Script | Purpose |
+|--------|---------|
+| `start-all.sh` | Start master and all workers |
+| `stop-all.sh` | Stop master and all workers |
+| `start-master.sh` | Start master on current machine |
+| `stop-master.sh` | Stop master |
+| `start-worker.sh` | Start worker on current machine |
+| `stop-worker.sh` | Stop worker |
+| `start-workers.sh` | Start workers on all machines in `conf/workers` |
+| `stop-workers.sh` | Stop all workers |
+| `start-history-server.sh` | Start history server |
+| `stop-history-server.sh` | Stop history server |

Review Comment:
   The markdown table is malformed due to double leading pipes and an incorrect 
separator row, which will render improperly. Replace with a standard markdown 
table format as shown below.



##########
sbin/README.md:
##########
@@ -0,0 +1,514 @@
+# Spark Admin Scripts
+
+This directory contains administrative scripts for managing Spark standalone 
clusters.
+
+## Overview
+
+The `sbin/` scripts are used by cluster administrators to:
+- Start and stop Spark standalone clusters
+- Start and stop individual daemons (master, workers, history server)
+- Manage cluster lifecycle
+- Configure cluster nodes
+
+**Note**: These scripts are for **Spark Standalone** cluster mode only. For 
YARN, Kubernetes, or Mesos, use their respective cluster management tools.
+
+## Cluster Management Scripts
+
+### start-all.sh / stop-all.sh
+
+Start or stop all Spark daemons on the cluster.
+
+**Usage:**
+```bash
+# Start master and all workers
+./sbin/start-all.sh
+
+# Stop all daemons
+./sbin/stop-all.sh
+```
+
+**What they do:**
+- `start-all.sh`: Starts master on the current machine and workers on machines 
listed in `conf/workers`
+- `stop-all.sh`: Stops all master and worker daemons
+
+**Prerequisites:**
+- SSH key-based authentication configured
+- `conf/workers` file with worker hostnames
+- Spark installed at same location on all machines
+
+**Configuration files:**
+- `conf/workers`: List of worker hostnames (one per line)
+- `conf/spark-env.sh`: Environment variables
+
+### start-master.sh / stop-master.sh
+
+Start or stop the Spark master daemon on the current machine.
+
+**Usage:**
+```bash
+# Start master
+./sbin/start-master.sh
+
+# Stop master
+./sbin/stop-master.sh
+```
+
+**Master Web UI**: Access at `http://<master-host>:8080/`
+
+**Configuration:**
+```bash
+# In conf/spark-env.sh
+export SPARK_MASTER_HOST=master-hostname
+export SPARK_MASTER_PORT=7077
+export SPARK_MASTER_WEBUI_PORT=8080
+```
+
+### start-worker.sh / stop-worker.sh
+
+Start or stop a Spark worker daemon on the current machine.
+
+**Usage:**
+```bash
+# Start worker connecting to master
+./sbin/start-worker.sh spark://master:7077
+
+# Stop worker
+./sbin/stop-worker.sh
+```
+
+**Worker Web UI**: Access at `http://<worker-host>:8081/`
+
+**Configuration:**
+```bash
+# In conf/spark-env.sh
+export SPARK_WORKER_CORES=8      # Number of cores to use
+export SPARK_WORKER_MEMORY=16g   # Memory to allocate
+export SPARK_WORKER_PORT=7078    # Worker port
+export SPARK_WORKER_WEBUI_PORT=8081
+export SPARK_WORKER_DIR=/var/spark/work  # Work directory
+```
+
+### start-workers.sh / stop-workers.sh
+
+Start or stop workers on all machines listed in `conf/workers`.
+
+**Usage:**
+```bash
+# Start all workers
+./sbin/start-workers.sh spark://master:7077
+
+# Stop all workers
+./sbin/stop-workers.sh
+```
+
+**Requirements:**
+- `conf/workers` file configured
+- SSH access to all worker machines
+- Master URL (for starting)
+
+## History Server Scripts
+
+### start-history-server.sh / stop-history-server.sh
+
+Start or stop the Spark History Server for viewing completed application logs.
+
+**Usage:**
+```bash
+# Start history server
+./sbin/start-history-server.sh
+
+# Stop history server
+./sbin/stop-history-server.sh
+```
+
+**History Server UI**: Access at `http://<host>:18080/`
+
+**Configuration:**
+```properties
+# In conf/spark-defaults.conf
+spark.history.fs.logDirectory=hdfs://namenode/spark-logs
+spark.history.ui.port=18080
+spark.eventLog.enabled=true
+spark.eventLog.dir=hdfs://namenode/spark-logs
+```
+
+**Requirements:**
+- Applications must have event logging enabled
+- Log directory must be accessible
+
+## Shuffle Service Scripts
+
+### start-shuffle-service.sh / stop-shuffle-service.sh
+
+Start or stop the external shuffle service (for YARN).
+
+**Usage:**
+```bash
+# Start shuffle service
+./sbin/start-shuffle-service.sh
+
+# Stop shuffle service
+./sbin/stop-shuffle-service.sh
+```
+
+**Note**: Typically used only when running on YARN without the YARN auxiliary 
service.
+
+## Configuration Files
+
+### conf/workers
+
+Lists worker hostnames, one per line.
+
+**Example:**
+```
+worker1.example.com
+worker2.example.com
+worker3.example.com
+```
+
+**Usage:**
+- Used by `start-all.sh` and `start-workers.sh`
+- Each line should contain a hostname or IP address
+- Blank lines and lines starting with `#` are ignored
+
+### conf/spark-env.sh
+
+Environment variables for Spark daemons.
+
+**Example:**
+```bash
+#!/usr/bin/env bash
+
+# Java
+export JAVA_HOME=/usr/lib/jvm/java-17
+
+# Master settings
+export SPARK_MASTER_HOST=master.example.com
+export SPARK_MASTER_PORT=7077
+export SPARK_MASTER_WEBUI_PORT=8080
+
+# Worker settings
+export SPARK_WORKER_CORES=8
+export SPARK_WORKER_MEMORY=16g
+export SPARK_WORKER_PORT=7078
+export SPARK_WORKER_WEBUI_PORT=8081
+export SPARK_WORKER_DIR=/var/spark/work
+
+# Directories
+export SPARK_LOG_DIR=/var/log/spark
+export SPARK_PID_DIR=/var/run/spark
+
+# History Server
+export 
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://namenode/spark-logs"
+
+# Additional Java options
+export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
-Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181"
+```
+
+**Key Variables:**
+
+**Master:**
+- `SPARK_MASTER_HOST`: Master hostname
+- `SPARK_MASTER_PORT`: Master port (default: 7077)
+- `SPARK_MASTER_WEBUI_PORT`: Web UI port (default: 8080)
+
+**Worker:**
+- `SPARK_WORKER_CORES`: Number of cores per worker
+- `SPARK_WORKER_MEMORY`: Memory per worker (e.g., 16g)
+- `SPARK_WORKER_PORT`: Worker communication port
+- `SPARK_WORKER_WEBUI_PORT`: Worker web UI port (default: 8081)
+- `SPARK_WORKER_DIR`: Directory for scratch space and logs
+- `SPARK_WORKER_INSTANCES`: Number of worker instances per machine
+
+**General:**
+- `SPARK_LOG_DIR`: Directory for daemon logs
+- `SPARK_PID_DIR`: Directory for PID files
+- `SPARK_IDENT_STRING`: Identifier for daemons (default: username)
+- `SPARK_NICENESS`: Nice value for daemons
+- `SPARK_DAEMON_MEMORY`: Memory for daemon processes
+
+## Setting Up a Standalone Cluster
+
+### Step 1: Install Spark on All Nodes
+
+```bash
+# Download and extract Spark on each machine
+tar xzf spark-X.Y.Z-bin-hadoopX.tgz
+cd spark-X.Y.Z-bin-hadoopX
+```
+
+### Step 2: Configure spark-env.sh
+
+Create `conf/spark-env.sh` from template:
+```bash
+cp conf/spark-env.sh.template conf/spark-env.sh
+# Edit conf/spark-env.sh with appropriate settings
+```
+
+### Step 3: Configure Workers File
+
+Create `conf/workers`:
+```bash
+cp conf/workers.template conf/workers
+# Add worker hostnames, one per line
+```
+
+### Step 4: Configure SSH Access
+
+Set up password-less SSH from master to all workers:
+```bash
+ssh-keygen -t rsa
+ssh-copy-id user@worker1
+ssh-copy-id user@worker2
+# ... for each worker
+```
+
+### Step 5: Synchronize Configuration
+
+Copy configuration to all workers:
+```bash
+for host in $(cat conf/workers); do
+  rsync -av conf/ user@$host:spark/conf/
+done
+```
+
+### Step 6: Start the Cluster
+
+```bash
+./sbin/start-all.sh
+```
+
+### Step 7: Verify
+
+- Check master UI: `http://master:8080`
+- Check worker UIs: `http://worker1:8081`, etc.
+- Look for workers registered with master
+
+## High Availability
+
+For production deployments, configure high availability with ZooKeeper.
+
+### ZooKeeper-based HA Configuration
+
+**In conf/spark-env.sh:**
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.deploy.recoveryMode=ZOOKEEPER
+  -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181,zk3:2181
+  -Dspark.deploy.zookeeper.dir=/spark
+"
+```
+
+### Start Multiple Masters
+
+```bash
+# On master1
+./sbin/start-master.sh
+
+# On master2
+./sbin/start-master.sh
+
+# On master3
+./sbin/start-master.sh
+```
+
+### Connect Workers to All Masters
+
+```bash
+./sbin/start-worker.sh spark://master1:7077,master2:7077,master3:7077
+```
+
+**Automatic failover:** If active master fails, standby masters detect the 
failure and one becomes active.
+
+## Monitoring and Logs
+
+### Log Files
+
+Daemon logs are written to `$SPARK_LOG_DIR` (default: `logs/`):
+
+```bash
+# Master log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.master.Master-*.out
+
+# Worker log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.worker.Worker-*.out
+
+# History Server log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.history.HistoryServer-*.out
+```
+
+### View Logs
+
+```bash
+# Tail master log
+tail -f logs/spark-*-master-*.out
+
+# Tail worker log
+tail -f logs/spark-*-worker-*.out
+
+# Search for errors
+grep ERROR logs/spark-*-master-*.out
+```
+
+### Web UIs
+
+- **Master UI**: `http://<master>:8080` - Cluster status, workers, applications
+- **Worker UI**: `http://<worker>:8081` - Worker status, running executors
+- **Application UI**: `http://<driver>:4040` - Running application metrics
+- **History Server**: `http://<history-server>:18080` - Completed applications
+
+## Advanced Configuration
+
+### Memory Overhead
+
+Reserve memory for system processes:
+```bash
+export SPARK_DAEMON_MEMORY=2g
+```
+
+### Multiple Workers per Machine
+
+Run multiple worker instances on a single machine:
+```bash
+export SPARK_WORKER_INSTANCES=2
+export SPARK_WORKER_CORES=4      # Cores per instance
+export SPARK_WORKER_MEMORY=8g    # Memory per instance
+```
+
+### Work Directory
+
+Change worker scratch space:
+```bash
+export SPARK_WORKER_DIR=/mnt/fast-disk/spark-work
+```
+
+### Port Configuration
+
+Use non-default ports:
+```bash
+export SPARK_MASTER_PORT=9077
+export SPARK_MASTER_WEBUI_PORT=9080
+export SPARK_WORKER_PORT=9078
+export SPARK_WORKER_WEBUI_PORT=9081
+```
+
+## Security
+
+### Enable Authentication
+
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.authenticate=true
+  -Dspark.authenticate.secret=your-secret-key
+"
+```
+
+### Enable SSL
+
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.ssl.enabled=true
+  -Dspark.ssl.keyStore=/path/to/keystore
+  -Dspark.ssl.keyStorePassword=password
+  -Dspark.ssl.trustStore=/path/to/truststore
+  -Dspark.ssl.trustStorePassword=password
+"
+```
+
+## Troubleshooting
+
+### Master Won't Start
+
+**Check:**
+1. Port 7077 is available: `netstat -an | grep 7077`
+2. Hostname is resolvable: `ping $SPARK_MASTER_HOST`
+3. Logs for errors: `cat logs/spark-*-master-*.out`
+
+### Workers Not Connecting
+
+**Check:**
+1. Master URL is correct
+2. Network connectivity: `telnet master 7077`
+3. Firewall allows connections
+4. Worker logs: `cat logs/spark-*-worker-*.out`
+
+### SSH Connection Issues
+
+**Solutions:**
+1. Verify SSH key: `ssh worker1 echo test`
+2. Check SSH config: `~/.ssh/config`
+3. Use SSH agent: `eval $(ssh-agent); ssh-add`
+
+### Insufficient Resources
+
+**Check:**
+- Worker has enough memory: `free -h`
+- Enough cores available: `nproc`
+- Disk space: `df -h`
+
+## Cluster Shutdown
+
+### Graceful Shutdown
+
+```bash
+# Stop all workers first
+./sbin/stop-workers.sh
+
+# Stop master
+./sbin/stop-master.sh
+
+# Or stop everything
+./sbin/stop-all.sh
+```
+
+### Check All Stopped
+
+```bash
+# Check for running Java processes
+jps | grep -E "(Master|Worker)"
+```
+
+### Force Kill if Needed
+
+```bash
+# Kill any remaining Spark processes
+pkill -f org.apache.spark.deploy
+```
+
+## Best Practices
+
+1. **Use HA in production**: Configure ZooKeeper-based HA
+2. **Monitor resources**: Watch CPU, memory, disk usage
+3. **Separate log directories**: Use dedicated disk for logs
+4. **Regular maintenance**: Clean old logs and application data
+5. **Automate startup**: Use systemd or init scripts
+6. **Configure limits**: Set file descriptor and process limits
+7. **Use external shuffle service**: For better fault tolerance
+8. **Back up metadata**: Regularly back up ZooKeeper data
+
+## Scripts Reference
+
+| Script | Purpose |
+|--------|---------|
+| `start-all.sh` | Start master and all workers |
+| `stop-all.sh` | Stop master and all workers |
+| `start-master.sh` | Start master on current machine |
+| `stop-master.sh` | Stop master |
+| `start-worker.sh` | Start worker on current machine |
+| `stop-worker.sh` | Stop worker |
+| `start-workers.sh` | Start workers on all machines in `conf/workers` |
+| `stop-workers.sh` | Stop all workers |
+| `start-history-server.sh` | Start history server |
+| `stop-history-server.sh` | Stop history server |

Review Comment:
   Replace the above with:



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Adds comprehensive top-level documentation to the repository [spark]

Reply via email to