kubraaksux opened a new pull request, #2431:
URL: https://github.com/apache/systemds/pull/2431

   LLM benchmarking framework and SystemDS JMLC backend for evaluating 
inference across OpenAI, Ollama, vLLM, and SystemDS on 5 workloads (math, 
reasoning, summarization, JSON extraction, embeddings).
   
   Extends the JMLC API with LLM inference support via a Py4J bridge. 
`Connection.java` manages the Python worker lifecycle, `PreparedScript.java` 
handles batch inference through FrameBlock, and the benchmark framework runs 
standardized workloads against all backends.
   
   ### Experimental results
   
   30 runs on NVIDIA H100, 50 samples each, across 4 backends and 2 local 
models (Qwen 3B, Mistral 7B).
   
   **Accuracy (% correct):**
   
   | Backend | math | reasoning | summarization | json_extraction | embeddings |
   |---------|------|-----------|---------------|-----------------|------------|
   | openai (gpt-4.1-mini) | 88% | 70% | 88% | 84% | 88% |
   | ollama (llama3.2) | 58% | 44% | 80% | 74% | 40% |
   | vllm (Qwen 3B) | 68% | 60% | 50% | 52% | 90% |
   | vllm (Mistral 7B) | 38% | 68% | 68% | 50% | 82% |
   | systemds (Qwen 3B) | 72% | 66% | 62% | 52% | 88% |
   | systemds (Mistral 7B) | 38% | 74% | 70% | 52% | 82% |
   
   Accuracy between vLLM and SystemDS is comparable since they run the same 
models. Small differences are within statistical noise (n=50).
   
   **Latency (p50, median response time):**
   
   | Backend | math | reasoning | summarization | json_extraction | embeddings |
   |---------|------|-----------|---------------|-----------------|------------|
   | vllm (Qwen 3B) | 4.7s | 2.5s | 742ms | 1.0s | 77ms |
   | systemds (Qwen 3B) | 22.2s | 7.0s | 2.1s | 3.1s | 144ms |
   | vllm (Mistral 7B) | 4.7s | 1.4s | 763ms | 1.8s | 135ms |
   | systemds (Mistral 7B) | 12.8s | 3.9s | 2.0s | 5.4s | 380ms |
   
   SystemDS is 2-5x slower than vLLM with the same model on the same GPU. The 
overhead comes from the Py4J bridge between Java and Python.
   
   **Cost (total across all runs):**
   
   | | OpenAI API | Local GPU (ollama + vllm + systemds) |
   |--|-----------|--------------------------------------|
   | Total cost | $0.0573 | $2.38 (electricity + HW amortization) |
   | Per query | $0.0002 | $0.001-0.02 depending on workload |
   
   Local inference has higher upfront cost but no per-query API charges. Cost 
tracking uses `--power-draw-w` and `--hardware-cost` flags.
   
   A self-contained HTML report (`benchmark_report.html`) with interactive 
tables and charts is included in the results.
   
   Made with [Cursor](https://cursor.com)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to