kitalkuyo-gita opened a new issue, #776:
URL: https://github.com/apache/geaflow/issues/776
## 1. Background and Motivation
### 1.1 Problem Statement
GeaFlow's existing inference subsystem (`geaflow-infer`) was built
exclusively around the PyTorch framework. The entire execution path — from
environment installation (`install-infer-env.sh`) to the Python inference
session (`inferSession.py`) — hardcodes PyTorch assumptions. As a result, users
who wish to deploy models from the PaddlePaddle ecosystem are completely
blocked:
- `install-infer-env.sh` only installs PyTorch and its dependencies.
- `infer_server.py` instantiates `TorchInferSession` unconditionally.
- There is no abstraction layer separating "session protocol" from
"framework implementation".
- Graph algorithms in GeaFlow DSL have no access to spatial GNN models such
as those provided by **PaddleSpatial**.
Meanwhile, PaddlePaddle is widely adopted in China's industrial AI
ecosystem, and **PaddleSpatial** provides production-grade spatial graph neural
network algorithms — notably the **Spatial Adaptive GNN (SA-GNN)** — that are
directly applicable to location-aware graph analytics (ride-hailing networks,
urban road graphs, spatial knowledge graphs, etc.).
### 1.2 Motivation: Why SA-GNN?
SA-GNN (Spatial Adaptive Graph Neural Network) improves upon
direction-agnostic GNNs such as GraphSAGE by:
1. **Partitioning neighbours into directional spatial sectors** (e.g., 8
compass directions) based on coordinate angles, then aggregating each sector
independently (`SpatialOrientedAGG`).
2. **Performing degree-normalised local aggregation** (`SpatialLocalAGG`) as
a fast, GCN-like first layer.
3. **Encoding location as a first-class feature**, enabling the model to
distinguish a node's north neighbours from its south neighbours — a distinction
invisible to standard GNNs.
| Model | Spatial awareness | Multi-hop | Framework |
|-------|------------------|-----------|-----------|
| GraphSAGE (existing) | None | Yes | PyTorch |
| **SA-GNN (this proposal)** | **Directional sectors** | **Yes** |
**PaddlePaddle / PGL** |
### 1.3 Design Goals
1. **G-1** Add first-class PaddlePaddle support to `geaflow-infer` without
breaking existing PyTorch workflows.
2. **G-2** Introduce a framework-agnostic inference session abstraction
(`BaseInferSession`) so future frameworks (TensorFlow, MindSpore, etc.) can be
plugged in with minimal effort.
3. **G-3** Implement the SA-GNN algorithm from PaddleSpatial as a built-in
GQL graph algorithm (`SAGNN`), callable via `CALL SAGNN(...)`.
4. **G-4** Allow flexible Python runtime selection: users may use a
Miniconda-managed virtual environment **or** a system-installed Python (useful
for local development and CI).
5. **G-5** All changes must be backward compatible: existing jobs using
`geaflow.infer.framework.type=TORCH` (or unset) must continue to work without
modification.
### 1.4 Non-Goals
- Replacing or deprecating PyTorch support.
- Supporting PaddlePaddle static-graph inference (Paddle Inference /
TensorRT) in this proposal — only dynamic eager mode is shipped initially.
- Training models inside GeaFlow; this proposal is for **inference only**.
- Multi-GPU distributed inference inside a single Python worker process.
---
## 2. Constraints
### 2.1 Architectural Constraints
**C-1: The Python inference sub-process must remain a single-threaded worker
loop.**
`infer_server.py` runs a tight `while True` loop reading from shared memory
and writing results back. All session implementations must be safe to call from
this loop and must not spawn additional threads.
**C-2: Data crossing the Java–Python bridge must be pickle-serialisable.**
The `PicklerDataBridger` uses Python's `pickle` protocol. `paddle.Tensor` is
**not** directly picklable; all tensor data must be converted to numpy arrays
or Python-native types before crossing the bridge. This is enforced by
`PaddleInferSession._coerce_to_native()`.
**C-3: The `TransFormFunction` interface contract must not change.**
Existing PyTorch UDFs implement `load_model()`, `transform_pre()`, and
`transform_post()`. The new `BaseInferSession` abstraction must accept any
conforming class without requiring UDF authors to change their code.
**C-4: The Java `InferContext` / `InferEnvironmentManager` lifecycle must
remain single-instance per JVM process.**
`InferEnvironmentManager` is a process-level singleton (guarded by
`AtomicBoolean INITIALIZED`). The Paddle path must work within this constraint.
### 2.2 Compatibility Constraints
**C-5: Default framework is TORCH — zero configuration change required for
existing jobs.**
The new config key `geaflow.infer.framework.type` defaults to `"TORCH"`. All
code paths that handle `null` or empty values fall back to `TORCH`.
**C-6: The `--tfClassName` CLI flag is kept for backward compatibility.**
`infer_server.py` still accepts `--tfClassName`; `--modelClassName` is the
new preferred alias. The resolution order is: `modelClassName` → `tfClassName`.
**C-7: System Python mode is opt-in.**
`geaflow.infer.env.use.system.python` defaults to `false`. When `false`, the
existing Miniconda virtual environment path is used unchanged.
---
## 3. Current State Analysis
### 3.1 Existing Inference Subsystem (before this change)
```
geaflow-infer/
├── InferEnvironmentManager.java # Creates Miniconda venv, runs install
shell
├── InferEnvironmentContext.java # Holds paths (pythonExec, inferScript,
libPath)
├── InferContext.java # Drives Java→Python lifecycle
└── resources/infer/
├── env/install-infer-env.sh # Shell: downloads Miniconda, installs
requirements.txt
└── inferRuntime/
├── infer_server.py # Python entry point (hardcoded
TorchInferSession)
└── inferSession.py # TorchInferSession (standalone, no base
class)
```
**Problems identified:**
- `infer_server.py:56` instantiates `TorchInferSession` unconditionally — no
dispatch mechanism.
- `inferSession.py` has no abstract interface — copy-paste would be required
to add any new framework.
- `install-infer-env.sh` has no conditional branch for non-PyTorch
frameworks.
- `FrameworkConfigKeys` has no keys related to framework selection.
### 3.2 Existing GQL Graph Algorithms
GraphSAGE (`GraphSAGE.java`) is the only GNN algorithm registered in
`BuildInSqlFunctionTable`. It uses `InferContextPool` backed exclusively by
PyTorch. SA-GNN is a distinct algorithm family that cannot be expressed as a
configuration variant of GraphSAGE due to its direction-aware aggregation and
PGL graph primitives.
---
## 4. Design
### 4.1 Architecture Overview
```
GeaFlow Job (Java)
│
│ config: geaflow.infer.framework.type = PADDLE
▼
InferEnvironmentManager
├── (TORCH path) → install-infer-env.sh [PyTorch] → inferFiles/
└── (PADDLE path)→ install-infer-env.sh [Paddle] → inferFiles/
│ install_paddlepaddle()
│ install_requirements(requirements_paddle.txt)
▼
InferContext
└── runs: python3 infer_server.py
--modelClassName=SAGNNTransFormFunction
--framework=PADDLE
--input_queue_shm_id=...
--output_queue_shm_id=...
infer_server.py
└── _create_infer_session(framework="PADDLE", transform_class)
└── PaddleInferSession(transform_class) ← new
│ inherits BaseInferSession ← new
▼
SAGNNTransFormFunction (user UDF)
│ uses PGL mini-graph
│ calls SAGNNModel.forward()
▼
List[float] embedding → pickle bridge → Java List<Double>
```
### 4.2 Python Session Layer Refactor
A new abstract base class is introduced:
```python
# baseInferSession.py (new)
class BaseInferSession(abc.ABC):
def __init__(self, transform_class): ...
@abc.abstractmethod
def run(self, *inputs): ...
```
Existing PyTorch session becomes a concrete subclass:
```python
# inferSession.py (updated)
class TorchInferSession(BaseInferSession):
def run(self, *inputs):
a, b = self._transform.transform_pre(*inputs)
return self._transform.transform_post(a)
```
New PaddlePaddle session:
```python
# paddleInferSession.py (new)
class PaddleInferSession(BaseInferSession):
def run(self, *inputs):
pre_result, aux = self._transform.transform_pre(*inputs)
post_result = self._transform.transform_post(pre_result)
return self._coerce_to_native(post_result) # Tensor → list
@staticmethod
def _coerce_to_native(obj):
# Recursively unwraps paddle.Tensor → Python list
...
```
Framework dispatch in `infer_server.py`:
```python
def _create_infer_session(framework, transform_class):
if framework.upper() == "PADDLE":
from paddleInferSession import PaddleInferSession
return PaddleInferSession(transform_class)
else:
from inferSession import TorchInferSession
return TorchInferSession(transform_class)
```
### 4.3 Environment Bootstrap
`install-infer-env.sh` receives two new arguments:
| Position | Name | Default |
|----------|------|---------|
| `$4` | `FRAMEWORK_TYPE` | `TORCH` |
| `$5` | `PADDLE_GPU_ENABLE` | `false` |
| `$6` | `PADDLE_CUDA_VERSION` | `11.7` |
When `FRAMEWORK_TYPE=PADDLE`, the script calls `install_paddlepaddle()`
**before** `install_requirements()`, because `pgl` and `paddlespatial` depend
on `paddlepaddle` being present at install time.
GPU wheel selection:
```bash
if [[ "${PADDLE_GPU_ENABLE}" == "true" ]]; then
cuda_postfix=$(echo "${PADDLE_CUDA_VERSION}" | tr -d '.') # "11.7" →
"117"
PADDLE_WHEEL="paddlepaddle-gpu==2.6.0.post${cuda_postfix}"
else
PADDLE_WHEEL="paddlepaddle==2.6.0"
fi
```
A dedicated `requirements_paddle.txt` is provided for PGL and PaddleSpatial
dependencies:
```
pgl>=2.2.4
paddlespatial>=0.1.0
numpy>=1.21.0,<2.0.0
scipy>=1.7.0
psutil>=5.9.0
```
### 4.4 Java Configuration Layer
`InferContext.runInferTask()` now passes `--framework` to the Python process:
```java
String frameworkType = config.getString(INFER_FRAMEWORK_TYPE);
if (frameworkType == null || frameworkType.isEmpty()) frameworkType =
"TORCH";
runCommands.add(inferEnvironmentContext.getInferFrameworkParam(frameworkType));
```
`InferEnvironmentManager.createInferVirtualEnv()` forwards the new arguments
to the shell script:
```java
execParams.add(frameworkType.toUpperCase());
execParams.add(String.valueOf(paddleGpu));
execParams.add(cudaVersion);
```
A system Python bypass (`INFER_ENV_USE_SYSTEM_PYTHON = true`) skips
Miniconda installation entirely, using a pre-installed Python interpreter on
the host. This is useful for development environments and CI pipelines where
conda overhead is undesirable.
### 4.5 SA-GNN Algorithm (`SAGNN.java`)
The `SAGNN` class implements `AlgorithmUserFunction` and is callable via GQL:
```sql
CALL SAGNN([numSamples, [numLayers]]) YIELD (vid, embedding)
```
**Algorithm iterations:**
| Iteration | Action |
|-----------|--------|
| 1 | Each vertex samples up to `numSamples` neighbours and broadcasts its
feature vector. |
| 2 | Collect received features into `neighbourFeatureCache`; re-broadcast
own features. |
| 3 … `numLayers+1` | Invoke Python SA-GNN model with vertex features +
cached neighbour features; store resulting embedding. |
**Feature vector convention:**
The last 2 elements of the vertex feature vector are treated as `(coord_x,
coord_y)` by the Python model. All preceding elements are semantic features. If
a vertex has fewer than 64 features, the vector is zero-padded; if more, it is
truncated to 64 dimensions.
### 4.6 SA-GNN Python Model (`PaddleSpatialSAGNNTransFormFunctionUDF.py`)
The UDF implements the full SA-GNN architecture using PGL primitives:
```
SAGNNModel
├── SpatialLocalAGG (GCN-like, degree-normalised message passing)
├── SpatialOrientedAGG
│ ├── _partition_edges_by_sector() (angle-based directional bucketing)
│ └── 9 × SpatialLocalAGG sub-convolutions (8 sectors + 1 catch-all)
└── Linear projection (hidden_dim → output_dim)
```
**Mini-graph construction per inference call:**
- Node 0: the centre vertex
- Nodes 1..K: its sampled neighbours (K = `numSamples`)
- Edges: directed `(i → 0)` for i in 1..K (neighbours → centre for message
passing)
- Node features: `(num_nodes, input_dim)` float32 array
- Node coordinates: `(num_nodes, 2)` float32 array (passed via
`graph.node_feat['coord']`)
---
## 5. Configuration Reference
### 5.1 New Configuration Keys
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `geaflow.infer.framework.type` | String | `TORCH` | Inference framework:
`TORCH` or `PADDLE` |
| `geaflow.infer.env.paddle.gpu.enable` | Boolean | `false` | Install
`paddlepaddle-gpu` instead of CPU-only |
| `geaflow.infer.env.paddle.cuda.version` | String | `11.7` | CUDA version
for GPU wheel (e.g. `11.7`, `12.0`) |
| `geaflow.infer.env.use.system.python` | Boolean | `false` | Skip
Miniconda, use system Python instead |
| `geaflow.infer.env.system.python.path` | String | _(none)_ | Absolute path
to system Python (e.g. `/usr/bin/python3`) |
### 5.2 Existing Keys (unchanged, required for SAGNN)
| Key | Required value for SAGNN |
|-----|--------------------------|
| `geaflow.infer.env.enable` | `true` |
| `geaflow.infer.env.user.transform.classname` | `SAGNNTransFormFunction` |
| `geaflow.infer.env.conda.url` | miniconda installer URL (if not using
system Python) |
### 5.3 Minimal Configuration Example
```properties
# Required
geaflow.infer.env.enable=true
geaflow.infer.framework.type=PADDLE
geaflow.infer.env.user.transform.classname=SAGNNTransFormFunction
geaflow.infer.env.conda.url=https://example.com/Miniconda3-latest-Linux-x86_64.sh
# Optional: GPU
geaflow.infer.env.paddle.gpu.enable=true
geaflow.infer.env.paddle.cuda.version=11.7
```
---
## 6. Changed Files
### 6.1 New Files
| File | Description |
|------|-------------|
| `geaflow-infer/…/inferRuntime/baseInferSession.py` | Abstract base class
for all inference sessions |
| `geaflow-infer/…/inferRuntime/paddleInferSession.py` | PaddlePaddle
inference session implementation |
| `geaflow-infer/…/inferRuntime/requirements_paddle.txt` | Pip requirements
for PGL / PaddleSpatial |
| `geaflow-dsl-plan/…/udf/graph/SAGNN.java` | SA-GNN GQL algorithm UDF |
| `geaflow-dsl-plan/…/resources/PaddleSpatialSAGNNTransFormFunctionUDF.py` |
Reference user UDF for SA-GNN |
### 6.2 Modified Files
| File | Change summary |
|------|---------------|
| `FrameworkConfigKeys.java` | Added 5 new config keys
(`INFER_FRAMEWORK_TYPE`, `INFER_ENV_PADDLE_*`, `INFER_ENV_USE_SYSTEM_PYTHON`,
`INFER_ENV_SYSTEM_PYTHON_PATH`) |
| `InferContext.java` | Pass `--framework` to Python sub-process; improve
env-ready polling with `ScheduledExecutorService` |
| `InferEnvironmentContext.java` | Support system Python path resolution;
add `getInferFrameworkParam()` and `getInferModelClassNameParam()` |
| `InferEnvironmentManager.java` | Add `constructSystemPythonEnvironment()`;
forward Paddle args to shell script |
| `install-infer-env.sh` | Add `install_paddlepaddle()` function; accept
`$4–$6` args; branch on `FRAMEWORK_TYPE` |
| `infer_server.py` | Add `--framework` CLI arg; implement
`_create_infer_session()` factory; accept `--modelClassName` alias |
| `inferSession.py` | Refactor `TorchInferSession` to extend
`BaseInferSession` |
| `BuildInSqlFunctionTable.java` | Register `SAGNN` as a built-in graph
algorithm |
---
## 7. API / GQL Changes
### 7.1 New GQL Syntax
```sql
-- Basic usage (10 neighbours, 2 layers by default)
CALL SAGNN() YIELD (vid, embedding)
-- With custom parameters
CALL SAGNN(20, 3) YIELD (vid, embedding)
-- Parameters:
-- arg[0]: numSamples (int, default 10) — neighbours to sample per vertex
-- arg[1]: numLayers (int, default 2) — number of SA-GNN layers
```
Output schema: `(vid ANY, embedding STRING)` — embedding is a
JSON-serialised `List<Double>`.
### 7.2 TransFormFunction Interface (unchanged)
Existing UDFs are not affected. The three-method contract (`load_model`,
`transform_pre`, `transform_post`) remains the extension point for both
frameworks:
```python
class TransFormFunction(abc.ABC):
def __init__(self, input_size: int): ...
def load_model(self, *args): ...
def transform_pre(self, *args): ...
def transform_post(self, *args): ...
```
---
## 8. Testing Plan
### 8.1 Unit Tests
- `PaddleInferSession._coerce_to_native()`: verify `paddle.Tensor`, nested
lists, dicts, and scalars are all correctly coerced to Python-native types.
- `SAGNNTransFormFunction._split_feat_coord()`: verify padding, truncation,
and edge cases (empty vector, vector shorter than `coord_dim`).
- `SAGNNTransFormFunction._build_mini_graph()`: verify graph has correct
node count, edge directions, and self-loop fallback for isolated nodes.
- `InferEnvironmentContext`: verify system Python path detection for
`/opt/homebrew/bin/python3`, `/usr/bin/python3`.
### 8.2 Integration Tests
| Test class | Coverage |
|-----------|----------|
| `SAGNNAlgorithmTest` | End-to-end GQL query with mock Python process |
| `SAGNNInferIntegrationTest` | Full Java→Python→Java round-trip using
system Python and a randomly-initialised SA-GNN model |
Test data files:
- `data/sagnn_vertex.txt`: vertex features including 2 coordinate columns at
the end
- `data/sagnn_edge.txt`: edge list
- `expect/gql_sagnn_001.txt`, `expect/gql_sagnn_002.txt`: expected output
snapshots
---
## 9. Rollout Plan
1. **Phase 1 (this PR)**: Merge all framework-layer changes + SAGNN
algorithm + `baseInferSession` abstraction.
2. **Phase 2 (follow-up)**: Add `paddlespatial` static inference mode
(`paddle.inference.create_predictor`) for production performance tuning.
3. **Phase 3 (future)**: Generalise the framework dispatch to support
additional backends (TensorFlow Lite, ONNX Runtime) using the same
`BaseInferSession` interface.
---
## 10. References
- [PaddleSpatial GitHub](https://github.com/PaddlePaddle/PaddleSpatial)
- [PGL (Paddle Graph Learning)](https://github.com/PaddlePaddle/PGL)
- SA-GNN paper: *Spatial Adaptive Graph Neural Network for Location-Aware
Services* (Baidu Research)
- GeaFlow-Infer design doc: `geaflow/geaflow-infer/README.md`
- Related issue: GeaFlow GraphSAGE integration (for design reference)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]