[PR] [SYSTEMDS-2850] Generalized parameter server Autoencoder [systemds]

via GitHub Mon, 16 Feb 2026 11:16:57 -0800


AdityaPandey2612 opened a new pull request, #2434:
URL: https://github.com/apache/systemds/pull/2434


   # SystemDS:  Parameter Server Autoencoder with variable hidden layers 
(Already rebased onto the main branch of SystemDS)
   
   ## Overview
   
   This repository contains a comprehensive implementation and experimental 
evaluation of distributed autoencoder training using Apache SystemDS. The 
project implements a generalized symmetric autoencoder in DML (Declarative 
Machine Learning) and provides a complete infrastructure for automated testing, 
validation, and performance analysis of parameter server-based distributed 
training.
   
   #  Key Contributions
   
   ## 1. Generalized Autoencoder Implementation (DML)
   - **`autoencoder_2layer.dml`** (867 lines): Core implementation supporting 
both DEFAULTSERVER (single-node) and PARAMSERVER (distributed) training modes
   - **`autoencoderGeneralized.dml`** (130 lines): Wrapper enabling arbitrary 
encoder depths with symmetric decoder mirroring
   - **`autoGradientCheck.dml`** (95 lines): Finite-difference gradient 
verification for correctness validation
   
   ## 2, Automated Testing Suite
   
   ### JUnit Integration Tests
   
   The implementation includes comprehensive JUnit tests integrated with the 
SystemDS test framework:
   
   #### **BuiltinAutoencoderGeneralizedTest.java**
   Complete test suite covering multiple architectural configurations:
   
   -  **testAutoencoderThreeLayerOutputs**: Validates 3-layer encoder (16→8→4) 
architecture
   -  **testAutoencoderTwoLayerOutputs**: Tests 2-layer encoder (16→8) with 
automatic bottleneck detection
   -  **testAutoencoderSingleLayerOutputs**: Single-layer encoder (16) edge case
   -  **testAutoencoderSparseInputOutputs**: Sparse data handling (20% density) 
with deeper network (32→16→8)
   -  **testAutoencoderParamservOutputs**: PARAMSERVER mode validation
   
   **Test Coverage:**
   - Matrix dimension verification (weights, hidden representations)
   - Output consistency checks (W1, Wlast, hidden layer)
   - Sparse and dense input matrices
   - Multiple encoder depths (1-3 layers)
   - Both DEFAULTSERVER and PARAMSERVER training modes
   
   #### **BuiltinAutoencoderGeneralizedBasicTest.java**
   Basic sanity test for quick validation:
   -  **testAutoencoderThreeLayerOutputs**: Smoke test for standard 3-layer 
configuration
   
   
   **Key Features:**
   - Arbitrary encoder depth with automatic decoder construction
   - Glorot weight initialization for deep networks
   - Support for multiple activation functions (tanh, sigmoid, ReLU)
   - Integration with SystemDS parameter server framework
   - Z-score normalization and random data shuffling
   
   # Experimental/Correctness/Testing Infrastructure
   
   **Automated Experiment Runner:**
   - **`run_sysds_experiments.py`**: Python-based automation framework for 
executing experiment sweeps
     - YAML-based configuration management
     - Grid parameter expansion
     - Multi-repeat execution for statistical analysis
     - Automatic CSV result generation with detailed metrics
     - Progress tracking and error handling
   
   **Configuration Files (8 YAML configs):**
   - `e16_default.yaml`: DEFAULTSERVER baseline experiments
   - `e16_ps_w2.yaml`: 2-worker parameter server configurations
   - `e16_ps_w4.yaml`: 4-worker parameter server with K-parameter sweep
   - `epoch_curve.yaml`: Convergence analysis over training epochs
   - `epoch_curve_sbp.yaml`: SBP staleness parameter exploration
   - `gradient_check.yaml`: Gradient verification test suite
   - `stress_suite.yaml`: Comprehensive stress testing (30+ configurations)
   - `epoch_curve_fast.yaml`: Quick validation experiments
   
   ##  Experimental Results Summary
   
   ### Correctness Validation 
   - **Gradient Checking**: Achieved relative errors of **10⁻⁹ to 10⁻⁷** across 
all layers
   - **Sanity Tests**: Passed no-op updates, tiny overfit, and W=1 equivalence 
checks
   - **Numerical Stability**: No gradient explosion or NaN propagation across 
700+ runs
   
   ### Convergence Performance 
   
   **Best Configuration:** SBP(K=2, W=4, ModelAvg=True)
   - **Final objective:** 8,801.7 (reconstruction error)
   - **Improvement over baseline:** 74.5 points (0.8% better than DEFAULTSERVER)
   - **Runtime overhead:** +0.28 seconds (17% increase for 1.68s baseline)
   
   **Key Findings:**
   | Configuration | Workers | Final Obj | Wall Time | Improvement |
   |--------------|---------|-----------|-----------|-------------|
   | DEFAULTSERVER | 1 | 8,876.2 | 1.68s | baseline |
   | PS BSP | 2 | 8,887.8 | 1.83s | -11.6 |
   | PS BSP | 4 | 8,848.6 | 1.85s | +27.6 |
   | PS SBP(K=3) | 4 | 8,858.8 | 1.83s | +17.4 |
   | **PS SBP(K=2)** | **4** | **8,801.7** | **1.90s** | **+74.5**  |
   
   ### Key Insights 
   
   1. **SBP Outperforms BSP**: Relaxed synchronization (K=2 out of 4 workers) 
achieves superior convergence compared to strict BSP
   2. **K Parameter Matters**: SBP(K=2) beats SBP(K=3) by 51.3 points, 
demonstrating optimal staleness tolerance
   3. **Model Averaging**: Provides consistent 5-7 point improvement for W=4 
configurations
   4. **Modest Overhead**: Distributed training adds only 0.15-0.28 seconds for 
moderate datasets (32,768 rows)
   5. **Scalability**: Execution time dominates compilation overhead for 
datasets >10K rows
   
   ##  Repository Structure
   ```
   LDE_Experiments/
   ├── src/test/scripts/functions/builtin/
   │   ├── autoencoder_2layer.dml           # Main implementation (867 lines)
   │   ├── autoencoderGeneralized.dml       # Generalized wrapper (130 lines)
   │   └── autoGradientCheck.dml            # Gradient checking (95 lines)
   ├── configs/
   │   ├── e16_default.yaml                 # Baseline experiments
   │   ├── e16_ps_w2.yaml                   # 2-worker configs
   │   ├── e16_ps_w4.yaml                   # 4-worker configs
   │   ├── epoch_curve.yaml                 # Convergence analysis
   │   ├── epoch_curve_sbp.yaml             # SBP parameter sweep
   │   ├── gradient_check.yaml              # Gradient verification
   │   ├── stress_suite.yaml                # Comprehensive testing
   │   └── epoch_curve_fast.yaml            # Quick validation
   ├── run_sysds_experiments.py             # Experiment automation
   ├── results/
   │   ├── results1.csv                     # Gradient checking (90 runs)
   │   ├── results2-4.csv                   # Early experiments
   │   ├── results8-11.csv                  # Scaling/stress tests
   │   ├── results12.csv                    # DEFAULTSERVER (5 runs)
   │   ├── results13.csv                    # PARAMSERVER W=2 (20 runs)
   │   └── results14.csv                    # PARAMSERVER W=4 (40 runs)
   ├── figures/                             # Generated visualizations
   │   ├── gradient_check_errors.png
   │   ├── gradient_check_by_layer.png
   │   ├── epoch16_comparison.png
   │   ├── bsp_vs_sbp_comparison.png
   │   ├── model_averaging_impact.png
   │   ├── runtime_breakdown.png
   │   ├── variance_analysis.png
   │   ├── performance_accuracy_tradeoff.png
   │   ├── convergence_w2.png
   │   ├── convergence_w4.png
   │   └── scaling_analysis.png
   ├── report_comprehensive.pdf             # Full technical report (38 pages)
   ├── report_comprehensive.tex             # LaTeX source
   └── README.md                            # This file
   ```
   
   ##  Quick Start
   
   ### Prerequisites
   
   - **Apache SystemDS** 3.0.0+ ([installation 
guide](https://github.com/apache/systemds))
   - **Java** 11 or higher
   - **Python** 3.8+ with packages: `pyyaml`, `pandas`, `matplotlib`, 
`seaborn`, `numpy`
   
   ### Installation
   ```bash
   # Clone repository
   git clone https://github.com/AdityaPandey2612/LDE_Experiments.git
   cd LDE_Experiments
   
   # Install Python dependencies
   pip install pyyaml pandas matplotlib seaborn numpy
   
   # Update SystemDS path in runner script
   # Edit run_sysds_experiments.py, line ~25:
   # SYSTEMDS_ROOT = "/path/to/your/SystemDS"  # UPDATE THIS
   ```
   
   ### Generate Data
   ```bash
   # Create data directory
   mkdir -p data
   
   # Generate 32,768 x 64 random training data
   systemds -f - << 'EOF'
   X = rand(rows=32768, cols=64, min=0, max=1, pdf="uniform");
   write(X, "data/X.bin", format="binary");
   print("Data generated successfully");
   EOF
   ```
   
   ### Run Experiments
   
   **Single configuration:**
   ```bash
   python run_sysds_experiments.py \
     --yaml configs/e16_ps_w4.yaml \
     --stage epoch_curve \
     --repeats 5 \
     --output results/results14.csv
   ```
   
   **Full experiment suite:**
   ```bash
   # Run all configurations
   for config in configs/*.yaml; do
     basename=$(basename $config .yaml)
     python run_sysds_experiments.py \
       --yaml $config \
       --repeats 5 \
       --output results/${basename}.csv
   done
   ```
   
   **Manual execution (for debugging):**
   ```bash
   systemds src/test/scripts/functions/builtin/autoencoderGeneralized.dml \
     -exec singlenode -stats -nvargs \
     X=data/X.bin H1=16 H2=8 H3=4 \
     EPOCH=16 BATCH=256 STEP=1e-4 DECAY=1.0 MOMENTUM=0.0 \
     FULL_OBJ=TRUE METHOD=PARAMSERVER MODE=LOCAL UTYPE=SBP \
     FREQ=EPOCH WORKERS=4 K=2 SCHEME=DISJOINT_RANDOM \
     NBATCHES=0 MODELAVG=TRUE \
     W1_out=W1.bin Wlast_out=Wlast.bin hidden_out=hidden.bin
   ```
   
   ### Analyze Results
   ```bash
   # Generate visualizations and statistics
   python analyze_results.py
   
   # Outputs:
   # - All PNG figures in current directory
   # - summary_statistics.txt
   ```
   
   ## Visualizations
   
   The analysis pipeline generates 11 publication-ready figures:
   
   1. **Gradient Check Errors**: Scatter plot of relative errors across all 
layers
   2. **Gradient Check by Layer**: Box plot showing error distribution per layer
   3. **EPOCH=16 Comparison**: Bar charts comparing final objective and wall 
time
   4. **BSP vs SBP Comparison**: Direct comparison for W=2 and W=4
   5. **Model Averaging Impact**: Effect of ModelAvg on convergence
   6. **Convergence Curves**: Before/after objective for W=2 and W=4
   7. **Scaling Analysis**: Runtime and convergence vs dataset size
   8. **Runtime Breakdown**: Compilation vs execution time
   9. **Variance Analysis**: Standard deviation across repeats
   10. **Performance-Accuracy Trade-off**: Scatter plot of runtime vs 
convergence quality
   
   ##  Experimental Details
   
   ### Model Architecture
   - **Input/Output:** 64 dimensions
   - **Encoder:** 64 → 16 → 8 → 4 (bottleneck)
   - **Decoder:** 4 → 8 → 16 → 64 (symmetric)
   - **Activation:** tanh (with derivative caching for efficient backprop)
   - **Loss:** Mean squared reconstruction error
   
   ### Training Configuration
   - **Dataset:** 32,768 rows × 64 columns (Gaussian random)
   - **Batch size:** 256
   - **Learning rate:** 10⁻⁴
   - **Momentum:** 0.0
   - **Decay:** 1.0 (no decay)
   - **Epochs:** 16 (primary experiments)
   
   ### Synchronization Strategies Evaluated
   
   | Strategy | Description | Workers | K Parameter |
   |----------|-------------|---------|-------------|
   | DEFAULTSERVER | Single-node SGD | 1 | N/A |
   | BSP | Bulk Synchronous Parallel | 2, 4 | N/A |
   | SBP | Stale synchronous with Backup workers | 2, 4 | 1, 2, 3 |
   
   **SBP Parameter K:**
   - K = number of workers to wait for before proceeding
   - Remaining (W-K) workers act as backups for straggler tolerance
   - K = W → equivalent to BSP
   - K < W → provides asynchrony and faster updates
   
   ## Performance Metrics
   
   ### Convergence Quality
   - **Gradient accuracy:** 10⁻⁹ to 10⁻⁷ relative error
   - **Final reconstruction error:** 8,801.7 (best SBP configuration)
   - **Improvement range:** -11.6 to +74.5 points vs baseline
   - **Coefficient of variation:** 0.8-1.2% (stable across repeats)
   
   ### Runtime Performance
   - **DEFAULTSERVER:** 1.68 ± 0.03 seconds
   - **PARAMSERVER overhead:** +0.15 to +0.28 seconds
   - **Compilation time:** ~0.52 seconds (constant)
   - **Execution scaling:** O(N^1.1) for dataset size N
   
   ### Statistical Analysis
   - **Total experimental runs:** 700+
   - **Configurations tested:** 50+
   - **Repeats per config:** 3-5
   - **CSV result files:** 11 (results1-results14)
   - **Total result size:** ~500KB
   
   ##  Technical Highlights
   
   ### DML Implementation Features
   - **Generalized architecture:** Supports arbitrary encoder depths via 
recursive construction
   - **Parameter server integration:** Native SystemDS `paramserv()` API usage
   - **Gradient computation:** Separate worker gradient function for 
distributed execution
   - **Aggregation function:** Server-side gradient aggregation and model update
   - **Momentum support:** Velocity accumulators maintained across iterations
   - **Glorot initialization:** Proper weight scaling for deep networks
   
   ### Infrastructure Features
   - **YAML configuration:** Declarative experiment definitions with grid 
expansion
   - **Automated execution:** Parallel experiment scheduling with progress 
tracking
   - **Error handling:** Timeout protection, retry logic, detailed error logging
   - **Metrics collection:** Comprehensive timing breakdown (compilation, 
execution, wall time)
   - **Result validation:** Automatic parsing of SystemDS statistics output
   - **Reproducibility:** Fixed seeds, deterministic ordering, version tracking
   
   ## Documentation
   
   ### Complete Technical Report
   The repository includes a comprehensive 38-page technical report 
(`report_comprehensive.pdf`) covering:
   
   - **Mathematical formulation:** Autoencoder architecture, loss function, 
backpropagation
   - **Implementation details:** Code walkthroughs with actual DML snippets
   - **Experimental methodology:** Configuration management, automation pipeline
   - **Correctness verification:** Gradient checking, sanity tests, numerical 
stability
   - **Convergence analysis:** Detailed comparison of synchronization strategies
   - **Scaling analysis:** Runtime and convergence vs dataset size and worker 
count
   - **Discussion:** Findings, limitations, when to use distributed training
   - **Reproducibility:** Step-by-step instructions, troubleshooting, 
verification checklist
   
   ### Key Sections
   1. Introduction and Research Objectives
   2. Model Architecture and Training Algorithms
   3. DML Implementation with Code Listings
   4. Experimental Infrastructure and YAML Configs
   5. Correctness Verification (Gradient Checking)
   6. Convergence Analysis (BSP vs SBP)
   7. Scaling Analysis and Overhead Breakdown
   8. Discussion and Future Work
   9. Complete Reproducibility Guide
   
   ##  Research Context
   
   This work was completed as part of the **Large-Scale Data Engineering** 
course at **Technische Universität Berlin**. The project demonstrates:
   
   - Scalable implementation of deep learning in declarative ML frameworks
   - Comprehensive experimental methodology for distributed systems evaluation
   - Trade-off analysis between convergence quality and synchronization overhead
   - Best practices for reproducible machine learning research
   
   ##  Future Work
   
   ### Algorithmic Extensions
   -  Implement fully asynchronous (ASP) parameter server mode
   -  Add adaptive K parameter based on worker latency distribution
   -  Integrate gradient compression (sparsification, quantization)
   -  Support local SGD (multiple local updates before sync)
   -  Implement adaptive learning rate methods (Adam, RMSprop)
   
   ### Infrastructure Enhancements
   -  Add checkpointing for true learning curves across epochs
   -  Implement distributed execution on multi-node Spark cluster
   -  Integrate Bayesian hyperparameter optimization
   -  Add real-time TensorBoard-style monitoring
   -  Support for convolutional and variational autoencoders
   
   ### Experimental Extensions
   -  Evaluate on real datasets (MNIST, CIFAR-10)
   -  Larger-scale experiments (1M+ samples, 1000+ dimensions)
   -  Benchmark against TensorFlow/PyTorch distributed training
   -  Heterogeneous worker environments (straggler simulation)
   -  Communication cost analysis in true distributed setting
   
   ##  Contributing
   - Additional synchronization strategies (SSP, Gossip, etc.)
   - Alternative architectures (VAE, DAE, CAE)
   - Real-world dataset experiments
   - Performance optimizations
   - Documentation improvements
   - Bug fixes and testing
   
   ## Author
   
   **Aditya Pandey**  
   Technische Universität Berlin  
   
   # For a more comprehensive understanding of the project, experimentation, 
and documentation, please look at the pdf below
   
[report_comprehensive.pdf](https://github.com/user-attachments/files/25348203/report_comprehensive.pdf)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [SYSTEMDS-2850] Generalized parameter server Autoencoder [systemds]

Reply via email to