kubraaksux opened a new pull request, #2431: URL: https://github.com/apache/systemds/pull/2431
LLM benchmarking framework and SystemDS JMLC backend for evaluating inference across OpenAI, Ollama, vLLM, and SystemDS on 5 workloads (math, reasoning, summarization, JSON extraction, embeddings). Extends the JMLC API with LLM inference support via a Py4J bridge. `Connection.java` manages the Python worker lifecycle, `PreparedScript.java` handles batch inference through FrameBlock, and the benchmark framework runs standardized workloads against all backends. ### Experimental results 30 runs on NVIDIA H100, 50 samples each, across 4 backends and 2 local models (Qwen 3B, Mistral 7B). **Accuracy (% correct):** | Backend | math | reasoning | summarization | json_extraction | embeddings | |---------|------|-----------|---------------|-----------------|------------| | openai (gpt-4.1-mini) | 88% | 70% | 88% | 84% | 88% | | ollama (llama3.2) | 58% | 44% | 80% | 74% | 40% | | vllm (Qwen 3B) | 68% | 60% | 50% | 52% | 90% | | vllm (Mistral 7B) | 38% | 68% | 68% | 50% | 82% | | systemds (Qwen 3B) | 72% | 66% | 62% | 52% | 88% | | systemds (Mistral 7B) | 38% | 74% | 70% | 52% | 82% | Accuracy between vLLM and SystemDS is comparable since they run the same models. Small differences are within statistical noise (n=50). **Latency (p50, median response time):** | Backend | math | reasoning | summarization | json_extraction | embeddings | |---------|------|-----------|---------------|-----------------|------------| | vllm (Qwen 3B) | 4.7s | 2.5s | 742ms | 1.0s | 77ms | | systemds (Qwen 3B) | 22.2s | 7.0s | 2.1s | 3.1s | 144ms | | vllm (Mistral 7B) | 4.7s | 1.4s | 763ms | 1.8s | 135ms | | systemds (Mistral 7B) | 12.8s | 3.9s | 2.0s | 5.4s | 380ms | SystemDS is 2-5x slower than vLLM with the same model on the same GPU. The overhead comes from the Py4J bridge between Java and Python. **Cost (total across all runs):** | | OpenAI API | Local GPU (ollama + vllm + systemds) | |--|-----------|--------------------------------------| | Total cost | $0.0573 | $2.38 (electricity + HW amortization) | | Per query | $0.0002 | $0.001-0.02 depending on workload | Local inference has higher upfront cost but no per-query API charges. Cost tracking uses `--power-draw-w` and `--hardware-cost` flags. A self-contained HTML report (`benchmark_report.html`) with interactive tables and charts is included in the results. Made with [Cursor](https://cursor.com) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
