Copilot commented on code in PR #344: URL: https://github.com/apache/incubator-hugegraph-computer/pull/344#discussion_r2740372831
########## vermeer/AGENTS.md: ########## @@ -0,0 +1,207 @@ +# AGENTS.md + +This file provides guidance to AI coding assistants when working with code in this repository. + +## Repository Overview + +Vermeer is a high-performance in-memory graph computing platform written in Go. It features a single-binary deployment model with master-worker architecture, supporting 20+ graph algorithms and seamless HugeGraph integration. + +## Build & Test Commands + +**Prerequisites:** +- Go 1.23+ +- `curl` and `unzip` (for downloading binary dependencies) + +**First-time setup:** +```bash +make init # Downloads supervisord and protoc binaries, installs Go deps +``` + +**Build:** +```bash +make # Build for current platform +make build-linux-amd64 +make build-linux-arm64 +``` + +**Development build with hot-reload UI:** +```bash +go build -tags=dev +``` + +**Clean:** +```bash +make clean # Remove built binaries and generated assets +make clean-all # Also remove downloaded tools +``` + +**Run:** +```bash +# Using binary directly +./vermeer --env=master +./vermeer --env=worker + +# Using script (configure in vermeer.sh) +./vermeer.sh start master +./vermeer.sh start worker +``` + +**Tests:** +```bash +# Run with build tag vermeer_test +go test -tags=vermeer_test -v + +# Specific test modes +go test -tags=vermeer_test -v -mode=algorithms +go test -tags=vermeer_test -v -mode=function +go test -tags=vermeer_test -v -mode=scheduler +``` + +**Regenerate protobuf (if proto files changed):** +```bash +go install google.golang.org/protobuf/cmd/[email protected] +go install google.golang.org/grpc/cmd/[email protected] +tools/protoc/osxm1/protoc *.proto --go-grpc_out=. --go_out=. +``` + +## Architecture + +### Directory Structure + +``` +vermeer/ +├── main.go # Single binary entry point +├── algorithms/ # Algorithm implementations +│ ├── algorithms.go # AlgorithmMaker registry +│ ├── pagerank.go +│ ├── louvain.go +│ └── ... +├── apps/ +│ ├── master/ # Master service +│ │ ├── services/ # HTTP handlers +│ │ ├── workers/ # Worker management (WorkerManager, WorkerClient) +│ │ ├── tasks/ # Task scheduling +│ │ └── graphs/ # Graph metadata management +│ ├── worker/ # Worker service entry +│ ├── compute/ # Worker-side compute logic +│ │ ├── api.go # Algorithm interface definition +│ │ ├── task.go # Compute task execution +│ │ └── ... +│ ├── graphio/ # Graph I/O (HugeGraph, CSV, HDFS) +│ │ └── hugegraph.go # HugeGraph integration +│ ├── protos/ # gRPC definitions +│ ├── common/ # Utilities, logging, metrics +│ ├── structure/ # Graph data structures +│ ├── storage/ # Persistence layer +│ └── bsp/ # BSP coordination helpers +├── config/ # Configuration templates +├── tools/ # Binary dependencies (supervisord, protoc) +└── ui/ # Web dashboard +``` + +### Key Design Patterns + +**1. Maker/Registry Pattern** + +Graph loaders and writers register themselves via `init()`: + +```go +func init() { + LoadMakers[LoadTypeHugegraph] = &HugegraphMaker{} +} +``` + +Master selects loader by type from the registry. Algorithms follow the same pattern in `algorithms/algorithms.go`. + +**2. Master-Worker Architecture** + +- **Master**: Schedules LoadPartition tasks to workers, manages worker lifecycle via WorkerManager/WorkerClient, exposes HTTP endpoints for graph/task management +- **Worker**: Executes compute tasks, reports status back to master via gRPC +- Communication: Master uses gRPC clients to workers (apps/master/workers/); workers connect to master on startup + +**3. HugeGraph Integration** + +Implementation in `apps/graphio/hugegraph.go`: + +1. **Metadata Query**: Queries HugeGraph PD (metadata service) via gRPC for partition information +2. **Data Loading**: Streams vertices/edges from HugeGraph Store via gRPC (`ScanPartition`) +3. **Result Writing**: Writes computed results back via HugeGraph HTTP REST API (adds vertex properties) + +The loader queries PD first (`QueryPartitions`), then creates LoadPartition tasks for each partition, which workers execute by calling `ScanPartition` on store nodes. + +**4. Algorithm Interface** + +Algorithms implement the interface defined in `apps/compute/api.go`. Each algorithm must register itself in `algorithms/algorithms.go` by appending to the `Algorithms` slice. + +**5. Single Binary Entry Point** + +`main.go` loads config from `config/{env}.ini`, then starts either master or worker based on `run_mode` parameter. The `--env` flag specifies which config file to use (e.g., `--env=master` loads `config/master.ini`). + +## Important Files + +- Entry point: `main.go` +- Algorithm interface: `apps/compute/api.go` +- Algorithm registry: `algorithms/algorithms.go` +- HugeGraph integration: `apps/graphio/hugegraph.go` +- Master scheduling: `apps/master/tasks/tasks.go` +- Worker management: `apps/master/workers/workers.go` +- HTTP endpoints: `apps/master/services/http_master.go` + +## Development Workflow + +**Adding a New Algorithm:** + +1. Create file in `algorithms/` implementing the interface from `apps/compute/api.go` +2. Register in `algorithms/algorithms.go` by appending to `Algorithms` slice +3. Implement required methods: `Init()`, `Compute()`, `Aggregate()`, `Terminate()` +4. Rebuild: `make` + +**Modifying Web UI:** + +1. Edit files in `ui/` +2. Regenerate assets: `cd asset && go generate` +3. Or use dev build: `go build -tags=dev` (hot-reload enabled) + +**Modifying Protobuf Definitions:** + +1. Edit `.proto` files in `apps/protos/` +2. Regenerate Go code using protoc (adjust path for platform): + ```bash + tools/protoc/osxm1/protoc apps/protos/*.proto --go-grpc_out=. --go_out=. + ``` + +## Configuration + +**Master (`config/master.ini`):** +- `http_peer`: Master HTTP listen address (default: 0.0.0.0:6688) +- `grpc_peer`: Master gRPC listen address (default: 0.0.0.0:6689) +- `run_mode`: Must be "master" +- `task_parallel_num`: Number of parallel tasks + +**Worker (`config/worker.ini`):** +- `http_peer`: Worker HTTP listen address (default: 0.0.0.0:6788) +- `grpc_peer`: Worker gRPC listen address (default: 0.0.0.0:6789) +- `master_peer`: Master gRPC address to connect (must match master's `grpc_peer`) +- `run_mode`: Must be "worker" Review Comment: The configuration documentation lists incorrect parameter names. The actual configuration files use `http_peer` and `grpc_peer`, not `listen_addr`. Please update lines 176-185 to match the actual configuration parameters in vermeer/config/master.ini and vermeer/config/worker.ini. The actual configs also don't have parameters like `task_parallel_num`, `compute_threads`, or `memory_limit` that are shown here. ########## vermeer/README.md: ########## @@ -1,125 +1,543 @@ -# Vermeer Graph Compute Engine +# Vermeer - High-Performance In-Memory Graph Computing + +Vermeer is a high-performance in-memory graph computing platform with a single-binary deployment model. It provides 20+ graph algorithms, custom algorithm extensions, and seamless integration with HugeGraph. + +## Key Features + +- **Single Binary Deployment**: Zero external dependencies, run anywhere +- **In-Memory Performance**: Optimized for fast iteration on medium to large graphs +- **Master-Worker Architecture**: Horizontal scalability by adding worker nodes +- **REST API + gRPC**: Easy integration with existing systems +- **Web UI Dashboard**: Built-in monitoring and job management +- **Multi-Source Support**: HugeGraph, local CSV, HDFS +- **20+ Graph Algorithms**: Production-ready implementations + +## Architecture + +```mermaid +graph TB + subgraph Client["Client Layer"] + API[REST API Client] + UI[Web UI Dashboard] + end + + subgraph Master["Master Node"] + HTTP[HTTP Server :6688] + GRPC_M[gRPC Server :6689] + GM[Graph Manager] + TM[Task Manager] + WM[Worker Manager] + SCH[Scheduler] + end + + subgraph Workers["Worker Nodes"] + W1[Worker 1 :6789] + W2[Worker 2 :6789] + W3[Worker N :6789] + end + + subgraph DataSources["Data Sources"] + HG[(HugeGraph)] + CSV[Local CSV] + HDFS[HDFS] + end + + API --> HTTP + UI --> HTTP + HTTP --> GM + HTTP --> TM + GRPC_M <--> W1 + GRPC_M <--> W2 + GRPC_M <--> W3 + + W1 <--> HG + W2 <--> HG + W3 <--> HG + W1 <--> CSV + W1 <--> HDFS + + style Master fill:#e1f5fe + style Workers fill:#fff3e0 + style DataSources fill:#f1f8e9 +``` + +### Directory Structure -## Introduction -Vermeer is a high-performance distributed graph computing platform based on memory, supporting more than 15 graph algorithms, custom algorithm extensions, and custom data source access. +``` +vermeer/ +├── main.go # Single binary entry point +├── Makefile # Build automation +├── algorithms/ # 20+ algorithm implementations +│ ├── pagerank.go +│ ├── louvain.go +│ ├── sssp.go +│ └── ... +├── apps/ +│ ├── master/ # Master service +│ │ ├── services/ # HTTP handlers +│ │ ├── workers/ # Worker management +│ │ └── tasks/ # Task scheduling +│ ├── compute/ # Worker-side compute logic +│ ├── graphio/ # Graph I/O (HugeGraph, CSV, HDFS) +│ │ └── hugegraph.go # HugeGraph integration +│ ├── protos/ # gRPC definitions +│ └── common/ # Utilities, logging, metrics +├── config/ # Configuration templates +│ ├── master.ini +│ └── worker.ini +├── tools/ # Binary dependencies (supervisord, protoc) +└── ui/ # Web dashboard +``` -## Run with Docker +## Quick Start + +### Option 1: Docker (Recommended) Pull the image: -``` + +```bash docker pull hugegraph/vermeer:latest ``` -Create local configuration files, for example, `~/master.ini` and `~/worker.ini`. +Create local configuration files `~/master.ini` and `~/worker.ini` (see [Configuration](#configuration) section). -Run with Docker. The `--env` flag specifies the file name. +Run with Docker: -``` -master: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master -worker: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker +```bash +# Master node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master + +# Worker node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker ``` -We've also provided a `docker-compose` file. Once you've created `~/master.ini` and `~/worker.ini`, and updated the `master_peer` in `worker.ini` to `172.20.0.10:6689`, you can run it using the following command: +#### Docker Compose -``` +Update `master_peer` in `~/worker.ini` to `172.20.0.10:6689`, then: + +```bash docker-compose up -d ``` -## Start +### Option 2: Binary Download +```bash +# Download binary (replace version and platform) +wget https://github.com/apache/hugegraph-computer/releases/download/vX.X.X/vermeer-linux-amd64.tar.gz +tar -xzf vermeer-linux-amd64.tar.gz +cd vermeer + +# Run master and worker +./vermeer --env=master & +./vermeer --env=worker & ``` -master: ./vermeer --env=master -worker: ./vermeer --env=worker01 -``` -The parameter env specifies the name of the configuration file in the useconfig folder. -``` +The `--env` parameter specifies the configuration file name in the `config/` folder (e.g., `master.ini`, `worker.ini`). + +#### Using the Shell Script + +Configure parameters in `vermeer.sh`, then: + +```bash ./vermeer.sh start master ./vermeer.sh start worker ``` -Configuration items are specified in vermeer.sh -## supervisord -Can be used with supervisord to start and stop services, automatically start applications, rotate logs, and more; for the configuration file, refer to config/supervisor.conf; -Configuration file reference config/supervisor.conf +### Option 3: Build from Source -```` -# run as daemon -./supervisord -c supervisor.conf -d -```` +#### Prerequisites -## Build from Source +- Go 1.23 or later +- `curl` and `unzip` utilities (for downloading dependencies) +- Internet connection (for first-time setup) -### Requirements -* Go 1.23 or later -* `curl` and `unzip` utilities (for downloading dependencies) -* Internet connection (for first-time setup) +#### Build Steps -### Quick Start - -**Recommended**: Use Makefile for building: +**Recommended**: Use Makefile: ```bash -# First time setup (downloads binary dependencies) +# First-time setup (downloads supervisord and protoc binaries) make init -# Build vermeer +# Build for current platform make + +# Or build for specific platform +make build-linux-amd64 +make build-linux-arm64 ``` -**Alternative**: Use the build script: +**Alternative**: Use build script: ```bash -# For AMD64 -./build.sh amd64 +# Auto-detect platform +./build.sh -# For ARM64 +# Or specify architecture +./build.sh amd64 ./build.sh arm64 ``` -# The script will: -# - Auto-detect your OS and architecture if no parameter is provided -# - Download required tools if not present -# - Generate assets and build the binary -# - Exit with error message if any step fails +#### Development Build -### Build Targets +For development with hot-reload of web UI: ```bash -make build # Build for current platform -make build-linux-amd64 # Build for Linux AMD64 -make build-linux-arm64 # Build for Linux ARM64 -make clean # Clean generated files -make help # Show all available targets +go build -tags=dev ``` -### Development Build +#### Clean Build Artifacts -For development with hot-reload of web UI: +```bash +make clean # Remove binaries and generated assets +make clean-all # Also remove downloaded tools (supervisord, protoc) +``` + +## Configuration + +### Master Configuration (`master.ini`) + +```ini +[master] +# Master server listen address +listen_addr = :6688 + +# Master gRPC server address +grpc_addr = :6689 + +# Worker heartbeat timeout (seconds) +worker_timeout = 30 + +# Task execution timeout (seconds) +task_timeout = 3600 + +[hugegraph] +# HugeGraph PD address for metadata +pd_peers = 127.0.0.1:8686 + +# HugeGraph HTTP endpoint for result writing +server = http://127.0.0.1:8080 + +# Graph space name +graph = hugegraph +``` + +### Worker Configuration (`worker.ini`) + +```ini +[worker] +# Worker listen address +listen_addr = :6789 + +# Master gRPC address to connect +master_peer = 127.0.0.1:6689 + +# Worker ID (unique) +worker_id = worker01 + +# Number of compute threads +compute_threads = 4 + +# Memory limit (GB) +memory_limit = 8 + +[storage] +# Local disk path for spilling +data_path = ./data +``` + +## Available Algorithms + +| Algorithm | Category | Description | +|-----------|----------|-------------| +| **PageRank** | Centrality | Measures vertex importance via link structure | +| **Personalized PageRank** | Centrality | PageRank from specific source vertices | +| **Betweenness Centrality** | Centrality | Measures vertex importance via shortest paths | +| **Closeness Centrality** | Centrality | Measures average distance to all other vertices | +| **Degree Centrality** | Centrality | Simple in/out degree calculation | +| **Louvain** | Community Detection | Modularity-based community detection | +| **Louvain (Weighted)** | Community Detection | Weighted variant for edge-weighted graphs | +| **LPA** | Community Detection | Label Propagation Algorithm | +| **SLPA** | Community Detection | Speaker-Listener Label Propagation | +| **WCC** | Community Detection | Weakly Connected Components | +| **SCC** | Community Detection | Strongly Connected Components | +| **SSSP** | Path Finding | Single Source Shortest Path (Dijkstra) | +| **Triangle Count** | Graph Structure | Counts triangles in the graph | +| **K-Core** | Graph Structure | Finds k-core subgraphs | +| **K-Out** | Graph Structure | K-degree filtering | +| **Clustering Coefficient** | Graph Structure | Measures local clustering | +| **Cycle Detection** | Graph Structure | Detects cycles in directed graphs | +| **Jaccard Similarity** | Similarity | Computes neighbor-based similarity | +| **Depth (BFS)** | Traversal | Breadth-First Search depth assignment | + +## API Overview + +Vermeer exposes a REST API on port `6688` (configurable in `master.ini`). + +### Key Endpoints + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/api/v1/graphs` | POST | Load graph from data source | +| `/api/v1/graphs/{graph_id}` | GET | Get graph metadata | +| `/api/v1/graphs/{graph_id}` | DELETE | Unload graph from memory | +| `/api/v1/compute` | POST | Execute algorithm on loaded graph | +| `/api/v1/tasks/{task_id}` | GET | Get task status and results | +| `/api/v1/workers` | GET | List connected workers | +| `/ui/` | GET | Web UI dashboard | + +### Example: Run PageRank ```bash -go build -tags=dev +# 1. Load graph from HugeGraph +curl -X POST http://localhost:6688/api/v1/graphs \ + -H "Content-Type: application/json" \ + -d '{ + "graph_name": "my_graph", + "load_type": "hugegraph", + "hugegraph": { + "pd_peers": ["127.0.0.1:8686"], + "graph_name": "hugegraph" + } + }' + +# 2. Run PageRank +curl -X POST http://localhost:6688/api/v1/compute \ + -H "Content-Type: application/json" \ + -d '{ + "graph_name": "my_graph", + "algorithm": "pagerank", + "params": { + "max_iterations": 20, + "damping_factor": 0.85 + }, + "output": { + "type": "hugegraph", + "property_name": "pagerank_value" + } + }' + +# 3. Check task status +curl http://localhost:6688/api/v1/tasks/{task_id} +``` + +### OLAP vs OLTP Modes + +- **OLAP Mode**: Load entire graph into memory, run multiple algorithms +- **OLTP Mode**: Query-driven, load subgraphs on demand (planned feature) + +## Data Sources + +### HugeGraph Integration + +Vermeer integrates with HugeGraph via: + +1. **Metadata Query**: Queries HugeGraph PD (metadata service) via gRPC for partition information +2. **Data Loading**: Streams vertices/edges from HugeGraph Store via gRPC (`ScanPartition`) +3. **Result Writing**: Writes computed results back via HugeGraph REST API (adds vertex properties) + +Configuration in graph load request: + +```json +{ + "load_type": "hugegraph", + "hugegraph": { + "pd_peers": ["127.0.0.1:8686"], + "graph_name": "hugegraph", + "vertex_label": "person", + "edge_label": "knows" + } +} +``` + +### Local CSV Files + +Load graphs from local CSV files: + +```json +{ + "load_type": "csv", + "csv": { + "vertex_file": "/path/to/vertices.csv", + "edge_file": "/path/to/edges.csv", + "delimiter": "," + } +} +``` + +### HDFS + +Load from Hadoop Distributed File System: + +```json +{ + "load_type": "hdfs", + "hdfs": { + "namenode": "hdfs://namenode:9000", + "vertex_path": "/graph/vertices", + "edge_path": "/graph/edges" + } +} +``` + +## Developing Custom Algorithms + +Custom algorithms implement the `Algorithm` interface in `algorithms/algorithms.go`: + +```go +type Algorithm interface { + // Initialize the algorithm + Init(params map[string]interface{}) error + + // Compute one iteration for a vertex + Compute(vertex *Vertex, messages []Message) (halt bool, outMessages []Message) + + // Aggregate global state (optional) + Aggregate() interface{} + + // Check termination condition + Terminate(iteration int) bool +} +``` + +### Example: Simple Degree Count + +```go +package algorithms + +type DegreeCount struct { + maxIter int +} + +func (dc *DegreeCount) Init(params map[string]interface{}) error { + dc.maxIter = params["max_iterations"].(int) + return nil +} + +func (dc *DegreeCount) Compute(vertex *Vertex, messages []Message) (bool, []Message) { + // Store degree as vertex value + vertex.SetValue(float64(len(vertex.OutEdges))) + + // Halt after first iteration + return true, nil +} + +func (dc *DegreeCount) Terminate(iteration int) bool { + return iteration >= dc.maxIter +} +``` + +Register the algorithm in `algorithms/algorithms.go`: + +```go +func init() { + RegisterAlgorithm("degree_count", &DegreeCount{}) +} +``` Review Comment: The DegreeCount example algorithm won't compile because it doesn't implement the actual `WorkerComputer` interface required by Vermeer. The actual interface requires methods like `Init()`, `BeforeStep()`, `Compute(vertexId uint32, pId int)`, `AfterStep()`, etc. Additionally, algorithms need to be registered via a Maker pattern (e.g., `DegreeCountMaker`) that implements `AlgorithmMaker`. See vermeer/algorithms/degree.go for a working example of degree counting. ########## vermeer/README.md: ########## @@ -1,125 +1,543 @@ -# Vermeer Graph Compute Engine +# Vermeer - High-Performance In-Memory Graph Computing + +Vermeer is a high-performance in-memory graph computing platform with a single-binary deployment model. It provides 20+ graph algorithms, custom algorithm extensions, and seamless integration with HugeGraph. + +## Key Features + +- **Single Binary Deployment**: Zero external dependencies, run anywhere +- **In-Memory Performance**: Optimized for fast iteration on medium to large graphs +- **Master-Worker Architecture**: Horizontal scalability by adding worker nodes +- **REST API + gRPC**: Easy integration with existing systems +- **Web UI Dashboard**: Built-in monitoring and job management +- **Multi-Source Support**: HugeGraph, local CSV, HDFS +- **20+ Graph Algorithms**: Production-ready implementations + +## Architecture + +```mermaid +graph TB + subgraph Client["Client Layer"] + API[REST API Client] + UI[Web UI Dashboard] + end + + subgraph Master["Master Node"] + HTTP[HTTP Server :6688] + GRPC_M[gRPC Server :6689] + GM[Graph Manager] + TM[Task Manager] + WM[Worker Manager] + SCH[Scheduler] + end + + subgraph Workers["Worker Nodes"] + W1[Worker 1 :6789] + W2[Worker 2 :6789] + W3[Worker N :6789] + end + + subgraph DataSources["Data Sources"] + HG[(HugeGraph)] + CSV[Local CSV] + HDFS[HDFS] + end + + API --> HTTP + UI --> HTTP + HTTP --> GM + HTTP --> TM + GRPC_M <--> W1 + GRPC_M <--> W2 + GRPC_M <--> W3 + + W1 <--> HG + W2 <--> HG + W3 <--> HG + W1 <--> CSV + W1 <--> HDFS + + style Master fill:#e1f5fe + style Workers fill:#fff3e0 + style DataSources fill:#f1f8e9 +``` + +### Directory Structure -## Introduction -Vermeer is a high-performance distributed graph computing platform based on memory, supporting more than 15 graph algorithms, custom algorithm extensions, and custom data source access. +``` +vermeer/ +├── main.go # Single binary entry point +├── Makefile # Build automation +├── algorithms/ # 20+ algorithm implementations +│ ├── pagerank.go +│ ├── louvain.go +│ ├── sssp.go +│ └── ... +├── apps/ +│ ├── master/ # Master service +│ │ ├── services/ # HTTP handlers +│ │ ├── workers/ # Worker management +│ │ └── tasks/ # Task scheduling +│ ├── compute/ # Worker-side compute logic +│ ├── graphio/ # Graph I/O (HugeGraph, CSV, HDFS) +│ │ └── hugegraph.go # HugeGraph integration +│ ├── protos/ # gRPC definitions +│ └── common/ # Utilities, logging, metrics +├── config/ # Configuration templates +│ ├── master.ini +│ └── worker.ini +├── tools/ # Binary dependencies (supervisord, protoc) +└── ui/ # Web dashboard +``` -## Run with Docker +## Quick Start + +### Option 1: Docker (Recommended) Pull the image: -``` + +```bash docker pull hugegraph/vermeer:latest ``` -Create local configuration files, for example, `~/master.ini` and `~/worker.ini`. +Create local configuration files `~/master.ini` and `~/worker.ini` (see [Configuration](#configuration) section). -Run with Docker. The `--env` flag specifies the file name. +Run with Docker: -``` -master: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master -worker: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker +```bash +# Master node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master + +# Worker node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker ``` -We've also provided a `docker-compose` file. Once you've created `~/master.ini` and `~/worker.ini`, and updated the `master_peer` in `worker.ini` to `172.20.0.10:6689`, you can run it using the following command: +#### Docker Compose -``` +Update `master_peer` in `~/worker.ini` to `172.20.0.10:6689`, then: + +```bash docker-compose up -d ``` -## Start +### Option 2: Binary Download +```bash +# Download binary (replace version and platform) +wget https://github.com/apache/hugegraph-computer/releases/download/vX.X.X/vermeer-linux-amd64.tar.gz +tar -xzf vermeer-linux-amd64.tar.gz +cd vermeer + +# Run master and worker +./vermeer --env=master & +./vermeer --env=worker & ``` -master: ./vermeer --env=master -worker: ./vermeer --env=worker01 -``` -The parameter env specifies the name of the configuration file in the useconfig folder. -``` +The `--env` parameter specifies the configuration file name in the `config/` folder (e.g., `master.ini`, `worker.ini`). + +#### Using the Shell Script + +Configure parameters in `vermeer.sh`, then: + +```bash ./vermeer.sh start master ./vermeer.sh start worker ``` -Configuration items are specified in vermeer.sh -## supervisord -Can be used with supervisord to start and stop services, automatically start applications, rotate logs, and more; for the configuration file, refer to config/supervisor.conf; -Configuration file reference config/supervisor.conf +### Option 3: Build from Source -```` -# run as daemon -./supervisord -c supervisor.conf -d -```` +#### Prerequisites -## Build from Source +- Go 1.23 or later +- `curl` and `unzip` utilities (for downloading dependencies) +- Internet connection (for first-time setup) -### Requirements -* Go 1.23 or later -* `curl` and `unzip` utilities (for downloading dependencies) -* Internet connection (for first-time setup) +#### Build Steps -### Quick Start - -**Recommended**: Use Makefile for building: +**Recommended**: Use Makefile: ```bash -# First time setup (downloads binary dependencies) +# First-time setup (downloads supervisord and protoc binaries) make init -# Build vermeer +# Build for current platform make + +# Or build for specific platform +make build-linux-amd64 +make build-linux-arm64 ``` -**Alternative**: Use the build script: +**Alternative**: Use build script: ```bash -# For AMD64 -./build.sh amd64 +# Auto-detect platform +./build.sh -# For ARM64 +# Or specify architecture +./build.sh amd64 ./build.sh arm64 ``` -# The script will: -# - Auto-detect your OS and architecture if no parameter is provided -# - Download required tools if not present -# - Generate assets and build the binary -# - Exit with error message if any step fails +#### Development Build -### Build Targets +For development with hot-reload of web UI: ```bash -make build # Build for current platform -make build-linux-amd64 # Build for Linux AMD64 -make build-linux-arm64 # Build for Linux ARM64 -make clean # Clean generated files -make help # Show all available targets +go build -tags=dev ``` -### Development Build +#### Clean Build Artifacts -For development with hot-reload of web UI: +```bash +make clean # Remove binaries and generated assets +make clean-all # Also remove downloaded tools (supervisord, protoc) +``` + +## Configuration + +### Master Configuration (`master.ini`) + +```ini +[master] +# Master server listen address +listen_addr = :6688 + +# Master gRPC server address +grpc_addr = :6689 + +# Worker heartbeat timeout (seconds) +worker_timeout = 30 + +# Task execution timeout (seconds) +task_timeout = 3600 + +[hugegraph] +# HugeGraph PD address for metadata +pd_peers = 127.0.0.1:8686 + +# HugeGraph HTTP endpoint for result writing +server = http://127.0.0.1:8080 + +# Graph space name +graph = hugegraph +``` + +### Worker Configuration (`worker.ini`) + +```ini +[worker] +# Worker listen address +listen_addr = :6789 + +# Master gRPC address to connect +master_peer = 127.0.0.1:6689 + +# Worker ID (unique) +worker_id = worker01 + +# Number of compute threads +compute_threads = 4 + +# Memory limit (GB) +memory_limit = 8 + +[storage] +# Local disk path for spilling +data_path = ./data Review Comment: The worker configuration example uses incorrect parameter names. The actual configuration file uses `http_peer`, `grpc_peer`, and doesn't have sections like `[worker]` or `[storage]`. Please update the configuration example to match the actual parameter names and structure used in vermeer/config/worker.ini. The actual config has a `[default]` section with `http_peer`, `grpc_peer`, `master_peer`, and `run_mode` parameters. ```suggestion [default] # Worker HTTP listen address http_peer = 127.0.0.1:6789 # Worker gRPC listen address grpc_peer = 127.0.0.1:6790 # Master gRPC address to connect master_peer = 127.0.0.1:6689 # Run mode: worker or standalone run_mode = worker ``` ########## vermeer/README.md: ########## @@ -1,125 +1,543 @@ -# Vermeer Graph Compute Engine +# Vermeer - High-Performance In-Memory Graph Computing + +Vermeer is a high-performance in-memory graph computing platform with a single-binary deployment model. It provides 20+ graph algorithms, custom algorithm extensions, and seamless integration with HugeGraph. + +## Key Features + +- **Single Binary Deployment**: Zero external dependencies, run anywhere +- **In-Memory Performance**: Optimized for fast iteration on medium to large graphs +- **Master-Worker Architecture**: Horizontal scalability by adding worker nodes +- **REST API + gRPC**: Easy integration with existing systems +- **Web UI Dashboard**: Built-in monitoring and job management +- **Multi-Source Support**: HugeGraph, local CSV, HDFS +- **20+ Graph Algorithms**: Production-ready implementations + +## Architecture + +```mermaid +graph TB + subgraph Client["Client Layer"] + API[REST API Client] + UI[Web UI Dashboard] + end + + subgraph Master["Master Node"] + HTTP[HTTP Server :6688] + GRPC_M[gRPC Server :6689] + GM[Graph Manager] + TM[Task Manager] + WM[Worker Manager] + SCH[Scheduler] + end + + subgraph Workers["Worker Nodes"] + W1[Worker 1 :6789] + W2[Worker 2 :6789] + W3[Worker N :6789] + end + + subgraph DataSources["Data Sources"] + HG[(HugeGraph)] + CSV[Local CSV] + HDFS[HDFS] + end + + API --> HTTP + UI --> HTTP + HTTP --> GM + HTTP --> TM + GRPC_M <--> W1 + GRPC_M <--> W2 + GRPC_M <--> W3 + + W1 <--> HG + W2 <--> HG + W3 <--> HG + W1 <--> CSV + W1 <--> HDFS + + style Master fill:#e1f5fe + style Workers fill:#fff3e0 + style DataSources fill:#f1f8e9 +``` + +### Directory Structure -## Introduction -Vermeer is a high-performance distributed graph computing platform based on memory, supporting more than 15 graph algorithms, custom algorithm extensions, and custom data source access. +``` +vermeer/ +├── main.go # Single binary entry point +├── Makefile # Build automation +├── algorithms/ # 20+ algorithm implementations +│ ├── pagerank.go +│ ├── louvain.go +│ ├── sssp.go +│ └── ... +├── apps/ +│ ├── master/ # Master service +│ │ ├── services/ # HTTP handlers +│ │ ├── workers/ # Worker management +│ │ └── tasks/ # Task scheduling +│ ├── compute/ # Worker-side compute logic +│ ├── graphio/ # Graph I/O (HugeGraph, CSV, HDFS) +│ │ └── hugegraph.go # HugeGraph integration +│ ├── protos/ # gRPC definitions +│ └── common/ # Utilities, logging, metrics +├── config/ # Configuration templates +│ ├── master.ini +│ └── worker.ini +├── tools/ # Binary dependencies (supervisord, protoc) +└── ui/ # Web dashboard +``` -## Run with Docker +## Quick Start + +### Option 1: Docker (Recommended) Pull the image: -``` + +```bash docker pull hugegraph/vermeer:latest ``` -Create local configuration files, for example, `~/master.ini` and `~/worker.ini`. +Create local configuration files `~/master.ini` and `~/worker.ini` (see [Configuration](#configuration) section). -Run with Docker. The `--env` flag specifies the file name. +Run with Docker: -``` -master: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master -worker: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker +```bash +# Master node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master + +# Worker node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker ``` -We've also provided a `docker-compose` file. Once you've created `~/master.ini` and `~/worker.ini`, and updated the `master_peer` in `worker.ini` to `172.20.0.10:6689`, you can run it using the following command: +#### Docker Compose -``` +Update `master_peer` in `~/worker.ini` to `172.20.0.10:6689`, then: + +```bash docker-compose up -d ``` -## Start +### Option 2: Binary Download +```bash +# Download binary (replace version and platform) +wget https://github.com/apache/hugegraph-computer/releases/download/vX.X.X/vermeer-linux-amd64.tar.gz +tar -xzf vermeer-linux-amd64.tar.gz +cd vermeer + +# Run master and worker +./vermeer --env=master & +./vermeer --env=worker & ``` -master: ./vermeer --env=master -worker: ./vermeer --env=worker01 -``` -The parameter env specifies the name of the configuration file in the useconfig folder. -``` +The `--env` parameter specifies the configuration file name in the `config/` folder (e.g., `master.ini`, `worker.ini`). + +#### Using the Shell Script + +Configure parameters in `vermeer.sh`, then: + +```bash ./vermeer.sh start master ./vermeer.sh start worker ``` -Configuration items are specified in vermeer.sh -## supervisord -Can be used with supervisord to start and stop services, automatically start applications, rotate logs, and more; for the configuration file, refer to config/supervisor.conf; -Configuration file reference config/supervisor.conf +### Option 3: Build from Source -```` -# run as daemon -./supervisord -c supervisor.conf -d -```` +#### Prerequisites -## Build from Source +- Go 1.23 or later +- `curl` and `unzip` utilities (for downloading dependencies) +- Internet connection (for first-time setup) -### Requirements -* Go 1.23 or later -* `curl` and `unzip` utilities (for downloading dependencies) -* Internet connection (for first-time setup) +#### Build Steps -### Quick Start - -**Recommended**: Use Makefile for building: +**Recommended**: Use Makefile: ```bash -# First time setup (downloads binary dependencies) +# First-time setup (downloads supervisord and protoc binaries) make init -# Build vermeer +# Build for current platform make + +# Or build for specific platform +make build-linux-amd64 +make build-linux-arm64 ``` -**Alternative**: Use the build script: +**Alternative**: Use build script: ```bash -# For AMD64 -./build.sh amd64 +# Auto-detect platform +./build.sh -# For ARM64 +# Or specify architecture +./build.sh amd64 ./build.sh arm64 ``` -# The script will: -# - Auto-detect your OS and architecture if no parameter is provided -# - Download required tools if not present -# - Generate assets and build the binary -# - Exit with error message if any step fails +#### Development Build -### Build Targets +For development with hot-reload of web UI: ```bash -make build # Build for current platform -make build-linux-amd64 # Build for Linux AMD64 -make build-linux-arm64 # Build for Linux ARM64 -make clean # Clean generated files -make help # Show all available targets +go build -tags=dev ``` -### Development Build +#### Clean Build Artifacts -For development with hot-reload of web UI: +```bash +make clean # Remove binaries and generated assets +make clean-all # Also remove downloaded tools (supervisord, protoc) +``` + +## Configuration + +### Master Configuration (`master.ini`) + +```ini +[master] +# Master server listen address +listen_addr = :6688 + +# Master gRPC server address +grpc_addr = :6689 + +# Worker heartbeat timeout (seconds) +worker_timeout = 30 + +# Task execution timeout (seconds) +task_timeout = 3600 + +[hugegraph] +# HugeGraph PD address for metadata +pd_peers = 127.0.0.1:8686 + +# HugeGraph HTTP endpoint for result writing +server = http://127.0.0.1:8080 + +# Graph space name +graph = hugegraph +``` Review Comment: The master configuration example includes a `[hugegraph]` section with parameters like `pd_peers`, `server`, and `graph`, but these don't exist in the actual vermeer/config/master.ini file. HugeGraph connection details are typically provided in the graph load request via the API, not in the master configuration file. This section should be removed or clarified that these are request parameters, not configuration file parameters. ########## vermeer/README.md: ########## @@ -1,125 +1,543 @@ -# Vermeer Graph Compute Engine +# Vermeer - High-Performance In-Memory Graph Computing + +Vermeer is a high-performance in-memory graph computing platform with a single-binary deployment model. It provides 20+ graph algorithms, custom algorithm extensions, and seamless integration with HugeGraph. + +## Key Features + +- **Single Binary Deployment**: Zero external dependencies, run anywhere +- **In-Memory Performance**: Optimized for fast iteration on medium to large graphs +- **Master-Worker Architecture**: Horizontal scalability by adding worker nodes +- **REST API + gRPC**: Easy integration with existing systems +- **Web UI Dashboard**: Built-in monitoring and job management +- **Multi-Source Support**: HugeGraph, local CSV, HDFS +- **20+ Graph Algorithms**: Production-ready implementations + +## Architecture + +```mermaid +graph TB + subgraph Client["Client Layer"] + API[REST API Client] + UI[Web UI Dashboard] + end + + subgraph Master["Master Node"] + HTTP[HTTP Server :6688] + GRPC_M[gRPC Server :6689] + GM[Graph Manager] + TM[Task Manager] + WM[Worker Manager] + SCH[Scheduler] + end + + subgraph Workers["Worker Nodes"] + W1[Worker 1 :6789] + W2[Worker 2 :6789] + W3[Worker N :6789] + end + + subgraph DataSources["Data Sources"] + HG[(HugeGraph)] + CSV[Local CSV] + HDFS[HDFS] + end + + API --> HTTP + UI --> HTTP + HTTP --> GM + HTTP --> TM + GRPC_M <--> W1 + GRPC_M <--> W2 + GRPC_M <--> W3 + + W1 <--> HG + W2 <--> HG + W3 <--> HG + W1 <--> CSV + W1 <--> HDFS + + style Master fill:#e1f5fe + style Workers fill:#fff3e0 + style DataSources fill:#f1f8e9 +``` + +### Directory Structure -## Introduction -Vermeer is a high-performance distributed graph computing platform based on memory, supporting more than 15 graph algorithms, custom algorithm extensions, and custom data source access. +``` +vermeer/ +├── main.go # Single binary entry point +├── Makefile # Build automation +├── algorithms/ # 20+ algorithm implementations +│ ├── pagerank.go +│ ├── louvain.go +│ ├── sssp.go +│ └── ... +├── apps/ +│ ├── master/ # Master service +│ │ ├── services/ # HTTP handlers +│ │ ├── workers/ # Worker management +│ │ └── tasks/ # Task scheduling +│ ├── compute/ # Worker-side compute logic +│ ├── graphio/ # Graph I/O (HugeGraph, CSV, HDFS) +│ │ └── hugegraph.go # HugeGraph integration +│ ├── protos/ # gRPC definitions +│ └── common/ # Utilities, logging, metrics +├── config/ # Configuration templates +│ ├── master.ini +│ └── worker.ini +├── tools/ # Binary dependencies (supervisord, protoc) +└── ui/ # Web dashboard +``` -## Run with Docker +## Quick Start + +### Option 1: Docker (Recommended) Pull the image: -``` + +```bash docker pull hugegraph/vermeer:latest ``` -Create local configuration files, for example, `~/master.ini` and `~/worker.ini`. +Create local configuration files `~/master.ini` and `~/worker.ini` (see [Configuration](#configuration) section). -Run with Docker. The `--env` flag specifies the file name. +Run with Docker: -``` -master: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master -worker: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker +```bash +# Master node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master + +# Worker node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker Review Comment: The Docker run examples mount the entire home directory into the container via `-v ~/:/go/bin/config`, which unnecessarily exposes all files under `~` (including SSH keys or cloud credentials) to any compromise of the Vermeer container. If an attacker gains code execution in this service (e.g., via its HTTP/gRPC API), they can read or modify sensitive data from the host’s home directory because of this broad volume mount. To reduce the blast radius, map only the specific configuration path needed (or a dedicated, least-privileged directory) instead of the full home directory. ```suggestion # Master node (mount only the master config file, read-only) docker run -v ~/master.ini:/go/bin/config/master.ini:ro hugegraph/vermeer --env=master # Worker node (mount only the worker config file, read-only) docker run -v ~/worker.ini:/go/bin/config/worker.ini:ro hugegraph/vermeer --env=worker ``` ########## computer/README.md: ########## @@ -1,12 +1,519 @@ -# Apache HugeGraph-Computer +# Apache HugeGraph-Computer (Java) -The hugegraph-computer is a distributed graph processing system for hugegraph. It is an implementation of [Pregel](https://kowshik.github.io/JPregel/pregel_paper.pdf). It runs on Kubernetes or YARN framework. +HugeGraph-Computer is a distributed graph processing framework implementing the [Pregel](https://kowshik.github.io/JPregel/pregel_paper.pdf) model (BSP - Bulk Synchronous Parallel). It runs on Kubernetes or YARN clusters and integrates with HugeGraph for graph input/output. ## Features -- Support distributed MPP graph computing, and integrates with HugeGraph as graph input/output storage. -- Based on BSP (Bulk Synchronous Parallel) model, an algorithm performs computing through multiple parallel iterations, every iteration is a superstep. -- Auto memory management. The framework will never be OOM(Out of Memory) since it will split some data to disk if it doesn't have enough memory to hold all the data. -- The part of edges or the messages of supernode can be in memory, so you will never lose it. -- You can load the data from HDFS or HugeGraph, output the results to HDFS or HugeGraph, or adapt any other systems manually as needed. -- Easy to develop a new algorithm. You need to focus on a vertex only processing just like as in a single server, without worrying about message transfer and memory/storage management. +- **Distributed MPP Computing**: Massively parallel graph processing across cluster nodes +- **BSP Model**: Algorithm execution through iterative supersteps with global synchronization +- **Auto Memory Management**: Automatic spill to disk when memory is insufficient - never OOM +- **Flexible Data Sources**: Load from HugeGraph or HDFS, output to HugeGraph or HDFS +- **Easy Algorithm Development**: Focus on single-vertex logic without worrying about distribution +- **Production-Ready**: Battle-tested on billion-scale graphs with Kubernetes integration + +## Architecture + +### Module Structure + +``` +┌─────────────────────────────────────────────────────────────┐ +│ HugeGraph-Computer │ +├─────────────────────────────────────────────────────────────┤ +│ ┌─────────────────────────────────────────────────────┐ │ +│ │ computer-driver │ │ +│ │ (Job Submission & Coordination) │ │ +│ └─────────────────────────┬───────────────────────────┘ │ +│ │ │ +│ ┌─────────────────────────┼───────────────────────────┐ │ +│ │ Deployment Layer (choose one) │ │ +│ │ ┌──────────────┐ ┌──────────────┐ │ │ +│ │ │ computer-k8s │ │ computer-yarn│ │ │ +│ │ └──────────────┘ └──────────────┘ │ │ +│ └─────────────────────────┬───────────────────────────┘ │ +│ │ │ +│ ┌─────────────────────────┼───────────────────────────┐ │ +│ │ computer-core │ │ +│ │ (WorkerService, MasterService, BSP) │ │ +│ │ ┌──────────────────────────────────────────────┐ │ │ +│ │ │ Managers: Message, Aggregation, Snapshot... │ │ │ +│ │ └──────────────────────────────────────────────┘ │ │ +│ └─────────────────────────┬───────────────────────────┘ │ +│ │ │ +│ ┌─────────────────────────┼───────────────────────────┐ │ +│ │ computer-algorithm │ │ +│ │ (PageRank, LPA, WCC, SSSP, TriangleCount...) │ │ +│ └─────────────────────────┬───────────────────────────┘ │ +│ │ │ +│ ┌─────────────────────────┴───────────────────────────┐ │ +│ │ computer-api │ │ +│ │ (Computation, Vertex, Edge, Aggregator, Value) │ │ +│ └─────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────┘ +``` + +### Module Descriptions + +| Module | Description | +|--------|-------------| +| **computer-api** | Public interfaces for algorithm development (`Computation`, `Vertex`, `Edge`, `Aggregator`, `Combiner`) | +| **computer-core** | Runtime implementation (WorkerService, MasterService, messaging, BSP coordination, memory management) | +| **computer-algorithm** | Built-in graph algorithms (45+ implementations) | +| **computer-driver** | Job submission and driver-side coordination | +| **computer-k8s** | Kubernetes deployment integration | +| **computer-yarn** | YARN deployment integration | +| **computer-k8s-operator** | Kubernetes operator for job lifecycle management | +| **computer-dist** | Distribution packaging and assembly | +| **computer-test** | Integration tests and unit tests | + +## Prerequisites + +- **JDK 11** or later (for building and running) +- **Maven 3.5+** for building +- **Kubernetes cluster** or **YARN cluster** for deployment +- **etcd** for BSP coordination (configured via `BSP_ETCD_URL`) + +**Note**: For K8s-operator module development, run `mvn clean install` in `computer-k8s-operator` first to generate CRD classes. + +## Quick Start + +### Build from Source + +```bash +cd computer + +# Compile (skip javadoc for faster builds) +mvn clean compile -Dmaven.javadoc.skip=true + +# Package (skip tests for faster packaging) +mvn clean package -DskipTests +``` + +### Run Tests + +```bash +# Unit tests +mvn test -P unit-test + +# Integration tests (requires etcd, K8s, HugeGraph) +mvn test -P integrate-test + +# Run specific test class +mvn test -P unit-test -Dtest=ClassName + +# Run specific test method +mvn test -P unit-test -Dtest=ClassName#methodName +``` + +### License Check + +```bash +mvn apache-rat:check +``` + +### Deploy on Kubernetes + +#### 1. Configure Job + +Create `job-config.properties`: + +```properties +# Algorithm class +algorithm.class=org.apache.hugegraph.computer.algorithm.centrality.pagerank.PageRank + +# HugeGraph connection +hugegraph.url=http://hugegraph-server:8080 +hugegraph.graph=hugegraph + +# K8s configuration +k8s.namespace=default +k8s.image=hugegraph/hugegraph-computer:latest +k8s.master.cpu=2 +k8s.master.memory=4Gi +k8s.worker.replicas=3 +k8s.worker.cpu=4 +k8s.worker.memory=8Gi + +# BSP coordination (etcd) +bsp.etcd.url=http://etcd-cluster:2379 + +# Algorithm parameters (PageRank example) +pagerank.damping_factor=0.85 +pagerank.max_iterations=20 +pagerank.convergence_tolerance=0.0001 Review Comment: The PageRank configuration example shows parameters `pagerank.damping_factor`, `pagerank.max_iterations`, and `pagerank.convergence_tolerance` that don't match the actual PageRank implementation. The actual parameters are `page_rank.alpha` (not damping_factor) and `pagerank.l1DiffThreshold` (not convergence_tolerance). The max_iterations is controlled by the BSP framework's `bsp.max_superstep` parameter, not a PageRank-specific parameter. Please update the example to use the actual parameter names from computer/computer-algorithm/src/main/java/org/apache/hugegraph/computer/algorithm/centrality/pagerank/PageRank.java and PageRank4Master.java. ```suggestion page_rank.alpha=0.85 bsp.max_superstep=20 pagerank.l1DiffThreshold=0.0001 ``` ########## vermeer/README.md: ########## @@ -1,125 +1,543 @@ -# Vermeer Graph Compute Engine +# Vermeer - High-Performance In-Memory Graph Computing + +Vermeer is a high-performance in-memory graph computing platform with a single-binary deployment model. It provides 20+ graph algorithms, custom algorithm extensions, and seamless integration with HugeGraph. + +## Key Features + +- **Single Binary Deployment**: Zero external dependencies, run anywhere +- **In-Memory Performance**: Optimized for fast iteration on medium to large graphs +- **Master-Worker Architecture**: Horizontal scalability by adding worker nodes +- **REST API + gRPC**: Easy integration with existing systems +- **Web UI Dashboard**: Built-in monitoring and job management +- **Multi-Source Support**: HugeGraph, local CSV, HDFS +- **20+ Graph Algorithms**: Production-ready implementations + +## Architecture + +```mermaid +graph TB + subgraph Client["Client Layer"] + API[REST API Client] + UI[Web UI Dashboard] + end + + subgraph Master["Master Node"] + HTTP[HTTP Server :6688] + GRPC_M[gRPC Server :6689] + GM[Graph Manager] + TM[Task Manager] + WM[Worker Manager] + SCH[Scheduler] + end + + subgraph Workers["Worker Nodes"] + W1[Worker 1 :6789] + W2[Worker 2 :6789] + W3[Worker N :6789] + end + + subgraph DataSources["Data Sources"] + HG[(HugeGraph)] + CSV[Local CSV] + HDFS[HDFS] + end + + API --> HTTP + UI --> HTTP + HTTP --> GM + HTTP --> TM + GRPC_M <--> W1 + GRPC_M <--> W2 + GRPC_M <--> W3 + + W1 <--> HG + W2 <--> HG + W3 <--> HG + W1 <--> CSV + W1 <--> HDFS + + style Master fill:#e1f5fe + style Workers fill:#fff3e0 + style DataSources fill:#f1f8e9 +``` + +### Directory Structure -## Introduction -Vermeer is a high-performance distributed graph computing platform based on memory, supporting more than 15 graph algorithms, custom algorithm extensions, and custom data source access. +``` +vermeer/ +├── main.go # Single binary entry point +├── Makefile # Build automation +├── algorithms/ # 20+ algorithm implementations +│ ├── pagerank.go +│ ├── louvain.go +│ ├── sssp.go +│ └── ... +├── apps/ +│ ├── master/ # Master service +│ │ ├── services/ # HTTP handlers +│ │ ├── workers/ # Worker management +│ │ └── tasks/ # Task scheduling +│ ├── compute/ # Worker-side compute logic +│ ├── graphio/ # Graph I/O (HugeGraph, CSV, HDFS) +│ │ └── hugegraph.go # HugeGraph integration +│ ├── protos/ # gRPC definitions +│ └── common/ # Utilities, logging, metrics +├── config/ # Configuration templates +│ ├── master.ini +│ └── worker.ini +├── tools/ # Binary dependencies (supervisord, protoc) +└── ui/ # Web dashboard +``` -## Run with Docker +## Quick Start + +### Option 1: Docker (Recommended) Pull the image: -``` + +```bash docker pull hugegraph/vermeer:latest ``` -Create local configuration files, for example, `~/master.ini` and `~/worker.ini`. +Create local configuration files `~/master.ini` and `~/worker.ini` (see [Configuration](#configuration) section). -Run with Docker. The `--env` flag specifies the file name. +Run with Docker: -``` -master: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master -worker: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker +```bash +# Master node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master + +# Worker node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker ``` -We've also provided a `docker-compose` file. Once you've created `~/master.ini` and `~/worker.ini`, and updated the `master_peer` in `worker.ini` to `172.20.0.10:6689`, you can run it using the following command: +#### Docker Compose -``` +Update `master_peer` in `~/worker.ini` to `172.20.0.10:6689`, then: + +```bash docker-compose up -d ``` -## Start +### Option 2: Binary Download +```bash +# Download binary (replace version and platform) +wget https://github.com/apache/hugegraph-computer/releases/download/vX.X.X/vermeer-linux-amd64.tar.gz +tar -xzf vermeer-linux-amd64.tar.gz +cd vermeer + +# Run master and worker +./vermeer --env=master & +./vermeer --env=worker & ``` -master: ./vermeer --env=master -worker: ./vermeer --env=worker01 -``` -The parameter env specifies the name of the configuration file in the useconfig folder. -``` +The `--env` parameter specifies the configuration file name in the `config/` folder (e.g., `master.ini`, `worker.ini`). + +#### Using the Shell Script + +Configure parameters in `vermeer.sh`, then: + +```bash ./vermeer.sh start master ./vermeer.sh start worker ``` -Configuration items are specified in vermeer.sh -## supervisord -Can be used with supervisord to start and stop services, automatically start applications, rotate logs, and more; for the configuration file, refer to config/supervisor.conf; -Configuration file reference config/supervisor.conf +### Option 3: Build from Source -```` -# run as daemon -./supervisord -c supervisor.conf -d -```` +#### Prerequisites -## Build from Source +- Go 1.23 or later +- `curl` and `unzip` utilities (for downloading dependencies) +- Internet connection (for first-time setup) -### Requirements -* Go 1.23 or later -* `curl` and `unzip` utilities (for downloading dependencies) -* Internet connection (for first-time setup) +#### Build Steps -### Quick Start - -**Recommended**: Use Makefile for building: +**Recommended**: Use Makefile: ```bash -# First time setup (downloads binary dependencies) +# First-time setup (downloads supervisord and protoc binaries) make init -# Build vermeer +# Build for current platform make + +# Or build for specific platform +make build-linux-amd64 +make build-linux-arm64 ``` -**Alternative**: Use the build script: +**Alternative**: Use build script: ```bash -# For AMD64 -./build.sh amd64 +# Auto-detect platform +./build.sh -# For ARM64 +# Or specify architecture +./build.sh amd64 ./build.sh arm64 ``` -# The script will: -# - Auto-detect your OS and architecture if no parameter is provided -# - Download required tools if not present -# - Generate assets and build the binary -# - Exit with error message if any step fails +#### Development Build -### Build Targets +For development with hot-reload of web UI: ```bash -make build # Build for current platform -make build-linux-amd64 # Build for Linux AMD64 -make build-linux-arm64 # Build for Linux ARM64 -make clean # Clean generated files -make help # Show all available targets +go build -tags=dev ``` -### Development Build +#### Clean Build Artifacts -For development with hot-reload of web UI: +```bash +make clean # Remove binaries and generated assets +make clean-all # Also remove downloaded tools (supervisord, protoc) +``` + +## Configuration + +### Master Configuration (`master.ini`) + +```ini +[master] +# Master server listen address +listen_addr = :6688 + +# Master gRPC server address +grpc_addr = :6689 Review Comment: The configuration examples in the README use incorrect parameter names. The actual configuration file uses `http_peer` and `grpc_peer`, but this documentation shows `listen_addr` and `grpc_addr`. Please update the configuration examples to match the actual parameter names used in vermeer/config/master.ini (line 19-20). ```suggestion http_peer = :6688 # Master gRPC server address grpc_peer = :6689 ``` ########## vermeer/README.md: ########## @@ -1,125 +1,543 @@ -# Vermeer Graph Compute Engine +# Vermeer - High-Performance In-Memory Graph Computing + +Vermeer is a high-performance in-memory graph computing platform with a single-binary deployment model. It provides 20+ graph algorithms, custom algorithm extensions, and seamless integration with HugeGraph. + +## Key Features + +- **Single Binary Deployment**: Zero external dependencies, run anywhere +- **In-Memory Performance**: Optimized for fast iteration on medium to large graphs +- **Master-Worker Architecture**: Horizontal scalability by adding worker nodes +- **REST API + gRPC**: Easy integration with existing systems +- **Web UI Dashboard**: Built-in monitoring and job management +- **Multi-Source Support**: HugeGraph, local CSV, HDFS +- **20+ Graph Algorithms**: Production-ready implementations + +## Architecture + +```mermaid +graph TB + subgraph Client["Client Layer"] + API[REST API Client] + UI[Web UI Dashboard] + end + + subgraph Master["Master Node"] + HTTP[HTTP Server :6688] + GRPC_M[gRPC Server :6689] + GM[Graph Manager] + TM[Task Manager] + WM[Worker Manager] + SCH[Scheduler] + end + + subgraph Workers["Worker Nodes"] + W1[Worker 1 :6789] + W2[Worker 2 :6789] + W3[Worker N :6789] + end + + subgraph DataSources["Data Sources"] + HG[(HugeGraph)] + CSV[Local CSV] + HDFS[HDFS] + end + + API --> HTTP + UI --> HTTP + HTTP --> GM + HTTP --> TM + GRPC_M <--> W1 + GRPC_M <--> W2 + GRPC_M <--> W3 + + W1 <--> HG + W2 <--> HG + W3 <--> HG + W1 <--> CSV + W1 <--> HDFS + + style Master fill:#e1f5fe + style Workers fill:#fff3e0 + style DataSources fill:#f1f8e9 +``` + +### Directory Structure -## Introduction -Vermeer is a high-performance distributed graph computing platform based on memory, supporting more than 15 graph algorithms, custom algorithm extensions, and custom data source access. +``` +vermeer/ +├── main.go # Single binary entry point +├── Makefile # Build automation +├── algorithms/ # 20+ algorithm implementations +│ ├── pagerank.go +│ ├── louvain.go +│ ├── sssp.go +│ └── ... +├── apps/ +│ ├── master/ # Master service +│ │ ├── services/ # HTTP handlers +│ │ ├── workers/ # Worker management +│ │ └── tasks/ # Task scheduling +│ ├── compute/ # Worker-side compute logic +│ ├── graphio/ # Graph I/O (HugeGraph, CSV, HDFS) +│ │ └── hugegraph.go # HugeGraph integration +│ ├── protos/ # gRPC definitions +│ └── common/ # Utilities, logging, metrics +├── config/ # Configuration templates +│ ├── master.ini +│ └── worker.ini +├── tools/ # Binary dependencies (supervisord, protoc) +└── ui/ # Web dashboard +``` -## Run with Docker +## Quick Start + +### Option 1: Docker (Recommended) Pull the image: -``` + +```bash docker pull hugegraph/vermeer:latest ``` -Create local configuration files, for example, `~/master.ini` and `~/worker.ini`. +Create local configuration files `~/master.ini` and `~/worker.ini` (see [Configuration](#configuration) section). -Run with Docker. The `--env` flag specifies the file name. +Run with Docker: -``` -master: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master -worker: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker +```bash +# Master node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master + +# Worker node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker ``` -We've also provided a `docker-compose` file. Once you've created `~/master.ini` and `~/worker.ini`, and updated the `master_peer` in `worker.ini` to `172.20.0.10:6689`, you can run it using the following command: +#### Docker Compose -``` +Update `master_peer` in `~/worker.ini` to `172.20.0.10:6689`, then: + +```bash docker-compose up -d ``` -## Start +### Option 2: Binary Download +```bash +# Download binary (replace version and platform) +wget https://github.com/apache/hugegraph-computer/releases/download/vX.X.X/vermeer-linux-amd64.tar.gz +tar -xzf vermeer-linux-amd64.tar.gz +cd vermeer + +# Run master and worker +./vermeer --env=master & +./vermeer --env=worker & ``` -master: ./vermeer --env=master -worker: ./vermeer --env=worker01 -``` -The parameter env specifies the name of the configuration file in the useconfig folder. -``` +The `--env` parameter specifies the configuration file name in the `config/` folder (e.g., `master.ini`, `worker.ini`). + +#### Using the Shell Script + +Configure parameters in `vermeer.sh`, then: + +```bash ./vermeer.sh start master ./vermeer.sh start worker ``` -Configuration items are specified in vermeer.sh -## supervisord -Can be used with supervisord to start and stop services, automatically start applications, rotate logs, and more; for the configuration file, refer to config/supervisor.conf; -Configuration file reference config/supervisor.conf +### Option 3: Build from Source -```` -# run as daemon -./supervisord -c supervisor.conf -d -```` +#### Prerequisites -## Build from Source +- Go 1.23 or later +- `curl` and `unzip` utilities (for downloading dependencies) +- Internet connection (for first-time setup) -### Requirements -* Go 1.23 or later -* `curl` and `unzip` utilities (for downloading dependencies) -* Internet connection (for first-time setup) +#### Build Steps -### Quick Start - -**Recommended**: Use Makefile for building: +**Recommended**: Use Makefile: ```bash -# First time setup (downloads binary dependencies) +# First-time setup (downloads supervisord and protoc binaries) make init -# Build vermeer +# Build for current platform make + +# Or build for specific platform +make build-linux-amd64 +make build-linux-arm64 ``` -**Alternative**: Use the build script: +**Alternative**: Use build script: ```bash -# For AMD64 -./build.sh amd64 +# Auto-detect platform +./build.sh -# For ARM64 +# Or specify architecture +./build.sh amd64 ./build.sh arm64 ``` -# The script will: -# - Auto-detect your OS and architecture if no parameter is provided -# - Download required tools if not present -# - Generate assets and build the binary -# - Exit with error message if any step fails +#### Development Build -### Build Targets +For development with hot-reload of web UI: ```bash -make build # Build for current platform -make build-linux-amd64 # Build for Linux AMD64 -make build-linux-arm64 # Build for Linux ARM64 -make clean # Clean generated files -make help # Show all available targets +go build -tags=dev ``` -### Development Build +#### Clean Build Artifacts -For development with hot-reload of web UI: +```bash +make clean # Remove binaries and generated assets +make clean-all # Also remove downloaded tools (supervisord, protoc) +``` + +## Configuration + +### Master Configuration (`master.ini`) + +```ini +[master] +# Master server listen address +listen_addr = :6688 + +# Master gRPC server address +grpc_addr = :6689 + +# Worker heartbeat timeout (seconds) +worker_timeout = 30 + +# Task execution timeout (seconds) +task_timeout = 3600 + +[hugegraph] +# HugeGraph PD address for metadata +pd_peers = 127.0.0.1:8686 + +# HugeGraph HTTP endpoint for result writing +server = http://127.0.0.1:8080 + +# Graph space name +graph = hugegraph +``` + +### Worker Configuration (`worker.ini`) + +```ini +[worker] +# Worker listen address +listen_addr = :6789 + +# Master gRPC address to connect +master_peer = 127.0.0.1:6689 + +# Worker ID (unique) +worker_id = worker01 + +# Number of compute threads +compute_threads = 4 + +# Memory limit (GB) +memory_limit = 8 + +[storage] +# Local disk path for spilling +data_path = ./data +``` + +## Available Algorithms + +| Algorithm | Category | Description | +|-----------|----------|-------------| +| **PageRank** | Centrality | Measures vertex importance via link structure | +| **Personalized PageRank** | Centrality | PageRank from specific source vertices | +| **Betweenness Centrality** | Centrality | Measures vertex importance via shortest paths | +| **Closeness Centrality** | Centrality | Measures average distance to all other vertices | +| **Degree Centrality** | Centrality | Simple in/out degree calculation | +| **Louvain** | Community Detection | Modularity-based community detection | +| **Louvain (Weighted)** | Community Detection | Weighted variant for edge-weighted graphs | +| **LPA** | Community Detection | Label Propagation Algorithm | +| **SLPA** | Community Detection | Speaker-Listener Label Propagation | +| **WCC** | Community Detection | Weakly Connected Components | +| **SCC** | Community Detection | Strongly Connected Components | +| **SSSP** | Path Finding | Single Source Shortest Path (Dijkstra) | +| **Triangle Count** | Graph Structure | Counts triangles in the graph | +| **K-Core** | Graph Structure | Finds k-core subgraphs | +| **K-Out** | Graph Structure | K-degree filtering | +| **Clustering Coefficient** | Graph Structure | Measures local clustering | +| **Cycle Detection** | Graph Structure | Detects cycles in directed graphs | +| **Jaccard Similarity** | Similarity | Computes neighbor-based similarity | +| **Depth (BFS)** | Traversal | Breadth-First Search depth assignment | + +## API Overview + +Vermeer exposes a REST API on port `6688` (configurable in `master.ini`). + +### Key Endpoints + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/api/v1/graphs` | POST | Load graph from data source | +| `/api/v1/graphs/{graph_id}` | GET | Get graph metadata | +| `/api/v1/graphs/{graph_id}` | DELETE | Unload graph from memory | +| `/api/v1/compute` | POST | Execute algorithm on loaded graph | +| `/api/v1/tasks/{task_id}` | GET | Get task status and results | +| `/api/v1/workers` | GET | List connected workers | +| `/ui/` | GET | Web UI dashboard | + +### Example: Run PageRank ```bash -go build -tags=dev +# 1. Load graph from HugeGraph +curl -X POST http://localhost:6688/api/v1/graphs \ + -H "Content-Type: application/json" \ + -d '{ + "graph_name": "my_graph", + "load_type": "hugegraph", + "hugegraph": { + "pd_peers": ["127.0.0.1:8686"], + "graph_name": "hugegraph" + } + }' + +# 2. Run PageRank +curl -X POST http://localhost:6688/api/v1/compute \ + -H "Content-Type: application/json" \ + -d '{ + "graph_name": "my_graph", + "algorithm": "pagerank", + "params": { + "max_iterations": 20, + "damping_factor": 0.85 + }, + "output": { + "type": "hugegraph", + "property_name": "pagerank_value" + } + }' + +# 3. Check task status +curl http://localhost:6688/api/v1/tasks/{task_id} +``` + +### OLAP vs OLTP Modes + +- **OLAP Mode**: Load entire graph into memory, run multiple algorithms +- **OLTP Mode**: Query-driven, load subgraphs on demand (planned feature) + +## Data Sources + +### HugeGraph Integration + +Vermeer integrates with HugeGraph via: + +1. **Metadata Query**: Queries HugeGraph PD (metadata service) via gRPC for partition information +2. **Data Loading**: Streams vertices/edges from HugeGraph Store via gRPC (`ScanPartition`) +3. **Result Writing**: Writes computed results back via HugeGraph REST API (adds vertex properties) + +Configuration in graph load request: + +```json +{ + "load_type": "hugegraph", + "hugegraph": { + "pd_peers": ["127.0.0.1:8686"], + "graph_name": "hugegraph", + "vertex_label": "person", + "edge_label": "knows" + } +} +``` + +### Local CSV Files + +Load graphs from local CSV files: + +```json +{ + "load_type": "csv", + "csv": { + "vertex_file": "/path/to/vertices.csv", + "edge_file": "/path/to/edges.csv", + "delimiter": "," + } +} +``` + +### HDFS + +Load from Hadoop Distributed File System: + +```json +{ + "load_type": "hdfs", + "hdfs": { + "namenode": "hdfs://namenode:9000", + "vertex_path": "/graph/vertices", + "edge_path": "/graph/edges" + } +} +``` + +## Developing Custom Algorithms + +Custom algorithms implement the `Algorithm` interface in `algorithms/algorithms.go`: + +```go +type Algorithm interface { + // Initialize the algorithm + Init(params map[string]interface{}) error + + // Compute one iteration for a vertex + Compute(vertex *Vertex, messages []Message) (halt bool, outMessages []Message) + + // Aggregate global state (optional) + Aggregate() interface{} + + // Check termination condition Review Comment: The simplified "Algorithm" interface shown here doesn't match the actual implementation. Vermeer algorithms actually implement the `WorkerComputer` and `MasterComputer` interfaces defined in `apps/compute/api.go`, not this simplified interface. The actual interfaces have methods like `Init()`, `BeforeStep()`, `Compute(vertexId, pId)`, `AfterStep()`, etc. This simplified example may confuse developers trying to actually implement custom algorithms. Consider either showing the actual interface or clearly marking this as a conceptual/simplified example. ```suggestion In the actual Vermeer implementation, custom algorithms are plugged in by implementing the `WorkerComputer` and (optionally) `MasterComputer` interfaces defined in `apps/compute/api.go`. These real interfaces include lifecycle hooks such as `Init()`, `BeforeStep()`, `Compute(vertexId, pId)`, and `AfterStep()`, along with other methods needed by the runtime. To illustrate the high-level algorithm lifecycle conceptually, the following **simplified** `Algorithm` interface is shown. This interface does **not** exist verbatim in the codebase; it is a conceptual model only: ```go // Conceptual, simplified view of an algorithm's lifecycle in Vermeer. // For real implementations, see WorkerComputer / MasterComputer in apps/compute/api.go. type Algorithm interface { // Initialize the algorithm with user-provided parameters. Init(params map[string]interface{}) error // Compute one iteration for a vertex, given incoming messages. Compute(vertex *Vertex, messages []Message) (halt bool, outMessages []Message) // Aggregate global state across vertices (optional). Aggregate() interface{} // Check whether the algorithm should terminate after the given iteration. ``` ########## computer/README.md: ########## @@ -1,12 +1,519 @@ -# Apache HugeGraph-Computer +# Apache HugeGraph-Computer (Java) -The hugegraph-computer is a distributed graph processing system for hugegraph. It is an implementation of [Pregel](https://kowshik.github.io/JPregel/pregel_paper.pdf). It runs on Kubernetes or YARN framework. +HugeGraph-Computer is a distributed graph processing framework implementing the [Pregel](https://kowshik.github.io/JPregel/pregel_paper.pdf) model (BSP - Bulk Synchronous Parallel). It runs on Kubernetes or YARN clusters and integrates with HugeGraph for graph input/output. ## Features -- Support distributed MPP graph computing, and integrates with HugeGraph as graph input/output storage. -- Based on BSP (Bulk Synchronous Parallel) model, an algorithm performs computing through multiple parallel iterations, every iteration is a superstep. -- Auto memory management. The framework will never be OOM(Out of Memory) since it will split some data to disk if it doesn't have enough memory to hold all the data. -- The part of edges or the messages of supernode can be in memory, so you will never lose it. -- You can load the data from HDFS or HugeGraph, output the results to HDFS or HugeGraph, or adapt any other systems manually as needed. -- Easy to develop a new algorithm. You need to focus on a vertex only processing just like as in a single server, without worrying about message transfer and memory/storage management. +- **Distributed MPP Computing**: Massively parallel graph processing across cluster nodes +- **BSP Model**: Algorithm execution through iterative supersteps with global synchronization +- **Auto Memory Management**: Automatic spill to disk when memory is insufficient - never OOM +- **Flexible Data Sources**: Load from HugeGraph or HDFS, output to HugeGraph or HDFS +- **Easy Algorithm Development**: Focus on single-vertex logic without worrying about distribution +- **Production-Ready**: Battle-tested on billion-scale graphs with Kubernetes integration + +## Architecture + +### Module Structure + +``` +┌─────────────────────────────────────────────────────────────┐ +│ HugeGraph-Computer │ +├─────────────────────────────────────────────────────────────┤ +│ ┌─────────────────────────────────────────────────────┐ │ +│ │ computer-driver │ │ +│ │ (Job Submission & Coordination) │ │ +│ └─────────────────────────┬───────────────────────────┘ │ +│ │ │ +│ ┌─────────────────────────┼───────────────────────────┐ │ +│ │ Deployment Layer (choose one) │ │ +│ │ ┌──────────────┐ ┌──────────────┐ │ │ +│ │ │ computer-k8s │ │ computer-yarn│ │ │ +│ │ └──────────────┘ └──────────────┘ │ │ +│ └─────────────────────────┬───────────────────────────┘ │ +│ │ │ +│ ┌─────────────────────────┼───────────────────────────┐ │ +│ │ computer-core │ │ +│ │ (WorkerService, MasterService, BSP) │ │ +│ │ ┌──────────────────────────────────────────────┐ │ │ +│ │ │ Managers: Message, Aggregation, Snapshot... │ │ │ +│ │ └──────────────────────────────────────────────┘ │ │ +│ └─────────────────────────┬───────────────────────────┘ │ +│ │ │ +│ ┌─────────────────────────┼───────────────────────────┐ │ +│ │ computer-algorithm │ │ +│ │ (PageRank, LPA, WCC, SSSP, TriangleCount...) │ │ +│ └─────────────────────────┬───────────────────────────┘ │ +│ │ │ +│ ┌─────────────────────────┴───────────────────────────┐ │ +│ │ computer-api │ │ +│ │ (Computation, Vertex, Edge, Aggregator, Value) │ │ +│ └─────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────┘ +``` + +### Module Descriptions + +| Module | Description | +|--------|-------------| +| **computer-api** | Public interfaces for algorithm development (`Computation`, `Vertex`, `Edge`, `Aggregator`, `Combiner`) | +| **computer-core** | Runtime implementation (WorkerService, MasterService, messaging, BSP coordination, memory management) | +| **computer-algorithm** | Built-in graph algorithms (45+ implementations) | +| **computer-driver** | Job submission and driver-side coordination | +| **computer-k8s** | Kubernetes deployment integration | +| **computer-yarn** | YARN deployment integration | +| **computer-k8s-operator** | Kubernetes operator for job lifecycle management | +| **computer-dist** | Distribution packaging and assembly | +| **computer-test** | Integration tests and unit tests | + +## Prerequisites + +- **JDK 11** or later (for building and running) +- **Maven 3.5+** for building +- **Kubernetes cluster** or **YARN cluster** for deployment +- **etcd** for BSP coordination (configured via `BSP_ETCD_URL`) + +**Note**: For K8s-operator module development, run `mvn clean install` in `computer-k8s-operator` first to generate CRD classes. + +## Quick Start + +### Build from Source + +```bash +cd computer + +# Compile (skip javadoc for faster builds) +mvn clean compile -Dmaven.javadoc.skip=true + +# Package (skip tests for faster packaging) +mvn clean package -DskipTests +``` + +### Run Tests + +```bash +# Unit tests +mvn test -P unit-test + +# Integration tests (requires etcd, K8s, HugeGraph) +mvn test -P integrate-test + +# Run specific test class +mvn test -P unit-test -Dtest=ClassName + +# Run specific test method +mvn test -P unit-test -Dtest=ClassName#methodName +``` + +### License Check + +```bash +mvn apache-rat:check +``` + +### Deploy on Kubernetes + +#### 1. Configure Job + +Create `job-config.properties`: + +```properties +# Algorithm class +algorithm.class=org.apache.hugegraph.computer.algorithm.centrality.pagerank.PageRank + +# HugeGraph connection +hugegraph.url=http://hugegraph-server:8080 +hugegraph.graph=hugegraph + +# K8s configuration +k8s.namespace=default +k8s.image=hugegraph/hugegraph-computer:latest +k8s.master.cpu=2 +k8s.master.memory=4Gi +k8s.worker.replicas=3 +k8s.worker.cpu=4 +k8s.worker.memory=8Gi + +# BSP coordination (etcd) +bsp.etcd.url=http://etcd-cluster:2379 + +# Algorithm parameters (PageRank example) +pagerank.damping_factor=0.85 +pagerank.max_iterations=20 +pagerank.convergence_tolerance=0.0001 +``` + +#### 2. Submit Job + +```bash +java -jar computer-driver/target/computer-driver-${VERSION}.jar \ + --config job-config.properties +``` + +#### 3. Monitor Job + +```bash +# Check pod status +kubectl get pods -n default + +# View master logs +kubectl logs hugegraph-computer-master-xxx -n default + +# View worker logs +kubectl logs hugegraph-computer-worker-0 -n default +``` + +### Deploy on YARN + +**Note**: YARN deployment support is under development. Use Kubernetes for production deployments. + +## Available Algorithms + +### Centrality Algorithms + +| Algorithm | Class | Description | +|-----------|-------|-------------| +| PageRank | `algorithm.centrality.pagerank.PageRank` | Standard PageRank | +| Personalized PageRank | `algorithm.centrality.pagerank.PersonalizedPageRank` | Source-specific PageRank | +| Betweenness Centrality | `algorithm.centrality.betweenness.BetweennessCentrality` | Shortest-path-based centrality | +| Closeness Centrality | `algorithm.centrality.closeness.ClosenessCentrality` | Average distance centrality | +| Degree Centrality | `algorithm.centrality.degree.DegreeCentrality` | In/out degree counting | + +### Community Detection + +| Algorithm | Class | Description | +|-----------|-------|-------------| +| LPA | `algorithm.community.lpa.Lpa` | Label Propagation Algorithm | +| WCC | `algorithm.community.wcc.Wcc` | Weakly Connected Components | +| Louvain | `algorithm.community.louvain.Louvain` | Modularity-based community detection | +| K-Core | `algorithm.community.kcore.KCore` | K-core decomposition | + +### Path Finding + +| Algorithm | Class | Description | +|-----------|-------|-------------| +| SSSP | `algorithm.path.sssp.Sssp` | Single Source Shortest Path | +| BFS | `algorithm.traversal.bfs.Bfs` | Breadth-First Search | +| Rings | `algorithm.path.rings.Rings` | Cycle/ring detection | + +### Graph Structure + +| Algorithm | Class | Description | +|-----------|-------|-------------| +| Triangle Count | `algorithm.trianglecount.TriangleCount` | Count triangles | +| Clustering Coefficient | `algorithm.clusteringcoefficient.ClusteringCoefficient` | Local clustering measure | + +**Full algorithm list**: See `computer-algorithm/src/main/java/org/apache/hugegraph/computer/algorithm/` + +## Developing Custom Algorithms + +### Algorithm Contract + +Algorithms implement the `Computation` interface from `computer-api`: + +```java +package org.apache.hugegraph.computer.core.worker; + +public interface Computation<M extends Value> { + /** + * Initialization at superstep 0 + */ + void compute0(ComputationContext context, Vertex vertex); + + /** + * Message processing in subsequent supersteps + */ + void compute(ComputationContext context, Vertex vertex, Iterator<M> messages); +} +``` + +### Example: Simple PageRank + +```java +package org.apache.hugegraph.computer.algorithm.centrality.pagerank; + +import org.apache.hugegraph.computer.core.worker.Computation; +import org.apache.hugegraph.computer.core.worker.ComputationContext; + +public class PageRank implements Computation<DoubleValue> { + + public static final String OPTION_ALPHA = "pagerank.alpha"; + public static final String OPTION_MAX_ITERATIONS = "pagerank.max_iterations"; + + private double alpha; + private int maxIterations; + + @Override + public void init(Config config) { + this.alpha = config.getDouble(OPTION_ALPHA, 0.85); + this.maxIterations = config.getInt(OPTION_MAX_ITERATIONS, 20); + } + + @Override + public void compute0(ComputationContext context, Vertex vertex) { + // Initialize: set initial PR value + vertex.value(new DoubleValue(1.0)); + + // Send PR to neighbors + int edgeCount = vertex.numEdges(); + if (edgeCount > 0) { + double contribution = 1.0 / edgeCount; + context.sendMessageToAllEdges(vertex, new DoubleValue(contribution)); + } + } + + @Override + public void compute(ComputationContext context, Vertex vertex, Iterator<DoubleValue> messages) { + // Sum incoming PR contributions + double sum = 0.0; + while (messages.hasNext()) { + sum += messages.next().value(); + } + + // Calculate new PR value + double newPR = (1.0 - alpha) + alpha * sum; + vertex.value(new DoubleValue(newPR)); + + // Send to neighbors if not converged + if (context.superstep() < maxIterations) { + int edgeCount = vertex.numEdges(); + if (edgeCount > 0) { + double contribution = newPR / edgeCount; + context.sendMessageToAllEdges(vertex, new DoubleValue(contribution)); + } + } else { + vertex.inactivate(); + } + } +} Review Comment: The simplified PageRank example doesn't import DoubleValue or show the required interface methods (name(), category(), init(), etc.) that are part of the actual Computation interface. While this is a simplified teaching example, it may confuse developers. Consider adding a note that this is a conceptual example and doesn't show all required methods, or reference the actual implementation at computer/computer-algorithm/src/main/java/org/apache/hugegraph/computer/algorithm/centrality/pagerank/PageRank.java. ########## vermeer/README.md: ########## @@ -1,125 +1,543 @@ -# Vermeer Graph Compute Engine +# Vermeer - High-Performance In-Memory Graph Computing + +Vermeer is a high-performance in-memory graph computing platform with a single-binary deployment model. It provides 20+ graph algorithms, custom algorithm extensions, and seamless integration with HugeGraph. + +## Key Features + +- **Single Binary Deployment**: Zero external dependencies, run anywhere +- **In-Memory Performance**: Optimized for fast iteration on medium to large graphs +- **Master-Worker Architecture**: Horizontal scalability by adding worker nodes +- **REST API + gRPC**: Easy integration with existing systems +- **Web UI Dashboard**: Built-in monitoring and job management +- **Multi-Source Support**: HugeGraph, local CSV, HDFS +- **20+ Graph Algorithms**: Production-ready implementations + +## Architecture + +```mermaid +graph TB + subgraph Client["Client Layer"] + API[REST API Client] + UI[Web UI Dashboard] + end + + subgraph Master["Master Node"] + HTTP[HTTP Server :6688] + GRPC_M[gRPC Server :6689] + GM[Graph Manager] + TM[Task Manager] + WM[Worker Manager] + SCH[Scheduler] + end + + subgraph Workers["Worker Nodes"] + W1[Worker 1 :6789] + W2[Worker 2 :6789] + W3[Worker N :6789] + end + + subgraph DataSources["Data Sources"] + HG[(HugeGraph)] + CSV[Local CSV] + HDFS[HDFS] + end + + API --> HTTP + UI --> HTTP + HTTP --> GM + HTTP --> TM + GRPC_M <--> W1 + GRPC_M <--> W2 + GRPC_M <--> W3 + + W1 <--> HG + W2 <--> HG + W3 <--> HG + W1 <--> CSV + W1 <--> HDFS + + style Master fill:#e1f5fe + style Workers fill:#fff3e0 + style DataSources fill:#f1f8e9 +``` + +### Directory Structure -## Introduction -Vermeer is a high-performance distributed graph computing platform based on memory, supporting more than 15 graph algorithms, custom algorithm extensions, and custom data source access. +``` +vermeer/ +├── main.go # Single binary entry point +├── Makefile # Build automation +├── algorithms/ # 20+ algorithm implementations +│ ├── pagerank.go +│ ├── louvain.go +│ ├── sssp.go +│ └── ... +├── apps/ +│ ├── master/ # Master service +│ │ ├── services/ # HTTP handlers +│ │ ├── workers/ # Worker management +│ │ └── tasks/ # Task scheduling +│ ├── compute/ # Worker-side compute logic +│ ├── graphio/ # Graph I/O (HugeGraph, CSV, HDFS) +│ │ └── hugegraph.go # HugeGraph integration +│ ├── protos/ # gRPC definitions +│ └── common/ # Utilities, logging, metrics +├── config/ # Configuration templates +│ ├── master.ini +│ └── worker.ini +├── tools/ # Binary dependencies (supervisord, protoc) +└── ui/ # Web dashboard +``` -## Run with Docker +## Quick Start + +### Option 1: Docker (Recommended) Pull the image: -``` + +```bash docker pull hugegraph/vermeer:latest ``` -Create local configuration files, for example, `~/master.ini` and `~/worker.ini`. +Create local configuration files `~/master.ini` and `~/worker.ini` (see [Configuration](#configuration) section). -Run with Docker. The `--env` flag specifies the file name. +Run with Docker: -``` -master: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master -worker: docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker +```bash +# Master node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=master + +# Worker node +docker run -v ~/:/go/bin/config hugegraph/vermeer --env=worker ``` -We've also provided a `docker-compose` file. Once you've created `~/master.ini` and `~/worker.ini`, and updated the `master_peer` in `worker.ini` to `172.20.0.10:6689`, you can run it using the following command: +#### Docker Compose -``` +Update `master_peer` in `~/worker.ini` to `172.20.0.10:6689`, then: + +```bash docker-compose up -d ``` -## Start +### Option 2: Binary Download +```bash +# Download binary (replace version and platform) +wget https://github.com/apache/hugegraph-computer/releases/download/vX.X.X/vermeer-linux-amd64.tar.gz +tar -xzf vermeer-linux-amd64.tar.gz +cd vermeer + +# Run master and worker +./vermeer --env=master & +./vermeer --env=worker & ``` -master: ./vermeer --env=master -worker: ./vermeer --env=worker01 -``` -The parameter env specifies the name of the configuration file in the useconfig folder. -``` +The `--env` parameter specifies the configuration file name in the `config/` folder (e.g., `master.ini`, `worker.ini`). + +#### Using the Shell Script + +Configure parameters in `vermeer.sh`, then: + +```bash ./vermeer.sh start master ./vermeer.sh start worker ``` -Configuration items are specified in vermeer.sh -## supervisord -Can be used with supervisord to start and stop services, automatically start applications, rotate logs, and more; for the configuration file, refer to config/supervisor.conf; -Configuration file reference config/supervisor.conf +### Option 3: Build from Source -```` -# run as daemon -./supervisord -c supervisor.conf -d -```` +#### Prerequisites -## Build from Source +- Go 1.23 or later +- `curl` and `unzip` utilities (for downloading dependencies) +- Internet connection (for first-time setup) -### Requirements -* Go 1.23 or later -* `curl` and `unzip` utilities (for downloading dependencies) -* Internet connection (for first-time setup) +#### Build Steps -### Quick Start - -**Recommended**: Use Makefile for building: +**Recommended**: Use Makefile: ```bash -# First time setup (downloads binary dependencies) +# First-time setup (downloads supervisord and protoc binaries) make init -# Build vermeer +# Build for current platform make + +# Or build for specific platform +make build-linux-amd64 +make build-linux-arm64 ``` -**Alternative**: Use the build script: +**Alternative**: Use build script: ```bash -# For AMD64 -./build.sh amd64 +# Auto-detect platform +./build.sh -# For ARM64 +# Or specify architecture +./build.sh amd64 ./build.sh arm64 ``` -# The script will: -# - Auto-detect your OS and architecture if no parameter is provided -# - Download required tools if not present -# - Generate assets and build the binary -# - Exit with error message if any step fails +#### Development Build -### Build Targets +For development with hot-reload of web UI: ```bash -make build # Build for current platform -make build-linux-amd64 # Build for Linux AMD64 -make build-linux-arm64 # Build for Linux ARM64 -make clean # Clean generated files -make help # Show all available targets +go build -tags=dev ``` -### Development Build +#### Clean Build Artifacts -For development with hot-reload of web UI: +```bash +make clean # Remove binaries and generated assets +make clean-all # Also remove downloaded tools (supervisord, protoc) +``` + +## Configuration + +### Master Configuration (`master.ini`) + +```ini +[master] +# Master server listen address +listen_addr = :6688 + +# Master gRPC server address +grpc_addr = :6689 + +# Worker heartbeat timeout (seconds) +worker_timeout = 30 + +# Task execution timeout (seconds) +task_timeout = 3600 + +[hugegraph] +# HugeGraph PD address for metadata +pd_peers = 127.0.0.1:8686 + +# HugeGraph HTTP endpoint for result writing +server = http://127.0.0.1:8080 + +# Graph space name +graph = hugegraph +``` + +### Worker Configuration (`worker.ini`) + +```ini +[worker] +# Worker listen address +listen_addr = :6789 + +# Master gRPC address to connect +master_peer = 127.0.0.1:6689 + +# Worker ID (unique) +worker_id = worker01 + +# Number of compute threads +compute_threads = 4 + +# Memory limit (GB) +memory_limit = 8 + +[storage] +# Local disk path for spilling +data_path = ./data +``` + +## Available Algorithms + +| Algorithm | Category | Description | +|-----------|----------|-------------| +| **PageRank** | Centrality | Measures vertex importance via link structure | +| **Personalized PageRank** | Centrality | PageRank from specific source vertices | +| **Betweenness Centrality** | Centrality | Measures vertex importance via shortest paths | +| **Closeness Centrality** | Centrality | Measures average distance to all other vertices | +| **Degree Centrality** | Centrality | Simple in/out degree calculation | +| **Louvain** | Community Detection | Modularity-based community detection | +| **Louvain (Weighted)** | Community Detection | Weighted variant for edge-weighted graphs | +| **LPA** | Community Detection | Label Propagation Algorithm | +| **SLPA** | Community Detection | Speaker-Listener Label Propagation | +| **WCC** | Community Detection | Weakly Connected Components | +| **SCC** | Community Detection | Strongly Connected Components | +| **SSSP** | Path Finding | Single Source Shortest Path (Dijkstra) | +| **Triangle Count** | Graph Structure | Counts triangles in the graph | +| **K-Core** | Graph Structure | Finds k-core subgraphs | +| **K-Out** | Graph Structure | K-degree filtering | +| **Clustering Coefficient** | Graph Structure | Measures local clustering | +| **Cycle Detection** | Graph Structure | Detects cycles in directed graphs | +| **Jaccard Similarity** | Similarity | Computes neighbor-based similarity | +| **Depth (BFS)** | Traversal | Breadth-First Search depth assignment | + +## API Overview + +Vermeer exposes a REST API on port `6688` (configurable in `master.ini`). + +### Key Endpoints + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/api/v1/graphs` | POST | Load graph from data source | +| `/api/v1/graphs/{graph_id}` | GET | Get graph metadata | +| `/api/v1/graphs/{graph_id}` | DELETE | Unload graph from memory | +| `/api/v1/compute` | POST | Execute algorithm on loaded graph | +| `/api/v1/tasks/{task_id}` | GET | Get task status and results | +| `/api/v1/workers` | GET | List connected workers | +| `/ui/` | GET | Web UI dashboard | + +### Example: Run PageRank ```bash -go build -tags=dev +# 1. Load graph from HugeGraph +curl -X POST http://localhost:6688/api/v1/graphs \ + -H "Content-Type: application/json" \ + -d '{ + "graph_name": "my_graph", + "load_type": "hugegraph", + "hugegraph": { + "pd_peers": ["127.0.0.1:8686"], + "graph_name": "hugegraph" + } + }' + +# 2. Run PageRank +curl -X POST http://localhost:6688/api/v1/compute \ + -H "Content-Type: application/json" \ + -d '{ + "graph_name": "my_graph", + "algorithm": "pagerank", + "params": { + "max_iterations": 20, + "damping_factor": 0.85 + }, + "output": { + "type": "hugegraph", + "property_name": "pagerank_value" + } + }' + +# 3. Check task status +curl http://localhost:6688/api/v1/tasks/{task_id} +``` + +### OLAP vs OLTP Modes + +- **OLAP Mode**: Load entire graph into memory, run multiple algorithms +- **OLTP Mode**: Query-driven, load subgraphs on demand (planned feature) + +## Data Sources + +### HugeGraph Integration + +Vermeer integrates with HugeGraph via: + +1. **Metadata Query**: Queries HugeGraph PD (metadata service) via gRPC for partition information +2. **Data Loading**: Streams vertices/edges from HugeGraph Store via gRPC (`ScanPartition`) +3. **Result Writing**: Writes computed results back via HugeGraph REST API (adds vertex properties) + +Configuration in graph load request: + +```json +{ + "load_type": "hugegraph", + "hugegraph": { + "pd_peers": ["127.0.0.1:8686"], + "graph_name": "hugegraph", + "vertex_label": "person", + "edge_label": "knows" + } +} +``` + +### Local CSV Files + +Load graphs from local CSV files: + +```json +{ + "load_type": "csv", + "csv": { + "vertex_file": "/path/to/vertices.csv", + "edge_file": "/path/to/edges.csv", + "delimiter": "," + } +} +``` + +### HDFS + +Load from Hadoop Distributed File System: + +```json +{ + "load_type": "hdfs", + "hdfs": { + "namenode": "hdfs://namenode:9000", + "vertex_path": "/graph/vertices", + "edge_path": "/graph/edges" + } +} +``` + +## Developing Custom Algorithms + +Custom algorithms implement the `Algorithm` interface in `algorithms/algorithms.go`: + +```go +type Algorithm interface { + // Initialize the algorithm + Init(params map[string]interface{}) error + + // Compute one iteration for a vertex + Compute(vertex *Vertex, messages []Message) (halt bool, outMessages []Message) + + // Aggregate global state (optional) + Aggregate() interface{} + + // Check termination condition + Terminate(iteration int) bool +} +``` + +### Example: Simple Degree Count + +```go +package algorithms + +type DegreeCount struct { + maxIter int +} + +func (dc *DegreeCount) Init(params map[string]interface{}) error { + dc.maxIter = params["max_iterations"].(int) + return nil +} + +func (dc *DegreeCount) Compute(vertex *Vertex, messages []Message) (bool, []Message) { + // Store degree as vertex value + vertex.SetValue(float64(len(vertex.OutEdges))) + + // Halt after first iteration + return true, nil +} + +func (dc *DegreeCount) Terminate(iteration int) bool { + return iteration >= dc.maxIter +} +``` + +Register the algorithm in `algorithms/algorithms.go`: + +```go +func init() { + RegisterAlgorithm("degree_count", &DegreeCount{}) +} +``` + +## Memory Management + +Vermeer uses an in-memory-first approach: + +1. **Graph Loading**: Vertices and edges are distributed across workers and stored in memory +2. **Automatic Partitioning**: Master assigns partitions to workers based on capacity +3. **Memory Monitoring**: Workers report memory usage to master +4. **Graceful Degradation**: If memory is insufficient, algorithms may fail (disk spilling not yet implemented) + +**Best Practice**: Ensure total worker memory exceeds graph size by 2-3x for algorithm workspace. + +## Supervisord Integration + +Run Vermeer as a daemon with automatic restarts and log rotation: + +```bash +# Configuration in config/supervisor.conf +./tools/supervisord -c config/supervisor.conf -d ``` ---- +Sample supervisor configuration: -### Protobuf Development +```ini +[program:vermeer-master] +command=/path/to/vermeer --env=master +autostart=true +autorestart=true +stdout_logfile=/var/log/vermeer-master.log +``` -If you need to regenerate protobuf files: +## Protobuf Development + +If you modify `.proto` files, regenerate Go code: ```bash # Install protobuf Go plugins go install google.golang.org/protobuf/cmd/[email protected] go install google.golang.org/grpc/cmd/[email protected] -# Generate protobuf files +# Generate (adjust protoc path for your platform) tools/protoc/osxm1/protoc *.proto --go-grpc_out=. --go_out=. ``` ---- +## Performance Tuning + +### Worker Configuration + +- **compute_threads**: Set to number of CPU cores for CPU-bound algorithms +- **memory_limit**: Set to 70-80% of available RAM +- **partition_count**: Increase for better parallelism (default: auto-calculated) + +### Master Configuration + +- **worker_timeout**: Increase for slow networks or heavily loaded workers +- **task_timeout**: Increase for long-running algorithms (e.g., Louvain on large graphs) Review Comment: The Performance Tuning section references configuration parameters like `compute_threads`, `memory_limit`, `partition_count`, `worker_timeout`, and `task_timeout` that don't exist in the actual vermeer configuration files. The actual configuration files (vermeer/config/master.ini and worker.ini) use different parameter names. Please verify these parameters exist or update the documentation to match the actual configuration schema. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
