[Discuss][DFIP] GeaFlow NL2Cypher Extension

windwheel Fri, 17 Oct 2025 23:17:21 -0700
Hello everyone, with the surge of AI, more and more users are accustomed to 
asking questions in natural language. However, the current graph computing 
languages GQL and Cypher syntax still require users to understand a certain 
level of complexity. I found that GeaFlow does not yet support NL2Cypher. I 
would like to propose a proposal and hope that you will consider it.
# GeaFlow NL2Cypher Extension 
## Abstract
GeaFlow NL2Cypher extends Apache GeaFlow's existing DSL capabilities to support 
natural language to Cypher query translation, enabling users to query graph 
databases using plain English instead of complex GQL syntax.
## Proposal
This proposal extends Apache GeaFlow's distributed streaming graph computing 
engine with Natural Language to Cypher (NL2Cypher) translation capabilities. 
The extension integrates with GeaFlow's existing DSL architecture to provide a 
seamless natural language interface for graph queries
## Background
Graph databases require specialized query language knowledge, creating barriers 
for non-technical users. GeaFlow's current DSL uses a compiler-based approach 
with syntax analysis, semantic analysis, and code generation phases. 
The existing `GeaFlowDSLParser` provides the foundation for extending to 
natural language input
## Technical Implementation
### Core Data Structures
**1. NL2Cypher Query Request Structure**```javapublic class NLQueryRequest {    
private String naturalLanguage;    private String graphSchema;    private 
Map<String, Object> context;    private QueryOptions options;}
public class CypherResponse {    private String generatedCypher;    private 
double confidence;    private List<String> alternatives;    private 
QueryValidationResult validation;}```
**2. Extended Parser Architecture**
Building on the existing `GeaFlowDSLParser` structure: 
```javapublic class NL2CypherParser extends GeaFlowDSLParser {    private final 
LLMInferenceEngine llmEngine;    private final QueryValidator validator;        
public SqlNode parseNaturalLanguage(String naturalLanguage, GraphSchema schema) 
{        // 1. Preprocess natural language input        NLQueryContext context 
= preprocessQuery(naturalLanguage, schema);                // 2. Generate 
Cypher using LLM        String cypher = llmEngine.translateToCypher(context);   
             // 3. Validate generated query        ValidationResult result = 
validator.validate(cypher, schema);                // 4. Parse to SqlNode using 
existing infrastructure        return parseStatement(cypher);    }}```
**3. Integration with Existing Query Processing**
The implementation leverages GeaFlow's existing `QueryClient` architecture: 
```javapublic class ExtendedQueryClient extends QueryClient {    private final 
NL2CypherParser nlParser = new NL2CypherParser();        public QueryResult 
executeNaturalLanguageQuery(String naturalLanguage, QueryContext context) {     
   try {            // Extract graph schema from context            GraphSchema 
schema = extractGraphSchema(context);                        // Convert NL to 
SqlNode            SqlNode sqlNode = 
nlParser.parseNaturalLanguage(naturalLanguage, schema);                        
// Use existing execution pipeline            return executeQuery(sqlNode, 
context);        } catch (Exception e) {            throw new 
GeaFlowDSLException("Error in NL query execution: " + naturalLanguage, e);      
  }    }}```
### Architecture Design
**Main Processing Pipeline:**```mermaidgraph TB    subgraph "Input Layer"       
 NL["Natural Language Query"]        Schema["Graph Schema Context"]    end      
  subgraph "NL2Cypher Module"        Preprocessor["Query Preprocessor"]        
LLM["LLM Inference Engine"]        Generator["Cypher Generator"]        
Validator["Query Validator"]    end        subgraph "Existing GeaFlow DSL"      
  Parser["GeaFlowDSLParser"]        Context["GQLContext"]        Planner["Query 
Planner"]    end        NL --> Preprocessor    Schema --> Preprocessor    
Preprocessor --> LLM    LLM --> Generator    Generator --> Validator    
Validator --> Parser    Parser --> Context    Context --> Planner```
**4. Graph Schema Integration**
Leveraging existing `GeaFlowGraph` structure:
```javapublic class NLQueryContext {    private final String naturalLanguage;   
 private final GraphRecordType graphType;    private final Map<String, 
EntityInfo> entities;    private final List<RelationshipInfo> relationships;    
    public static NLQueryContext from(String query, GeaFlowGraph graph, 
RelDataTypeFactory typeFactory) {        GraphRecordType graphType = 
(GraphRecordType) graph.getRowType(typeFactory);        return new 
NLQueryContext(query, graphType, extractEntities(query), 
extractRelationships(query));    }}```
**5. Validation Integration**
Building on existing validation infrastructure:
```javapublic class NL2CypherValidator extends GQLValidatorImpl {        public 
ValidationResult validateGeneratedCypher(String cypher, GraphSchema schema) {   
     try {            // Parse generated Cypher            SqlNode sqlNode = 
parser.parseStatement(cypher);                        // Validate using 
existing infrastructure            SqlNode validated = validate(sqlNode);       
                 return ValidationResult.success(validated);        } catch 
(Exception e) {            return ValidationResult.failure(e.getMessage());     
   }    }}```
### Module Interaction Sequence
```mermaidsequenceDiagram    participant User    participant 
ExtendedQueryClient    participant NL2CypherParser    participant LLMEngine    
participant GeaFlowDSLParser    participant QueryEngine
    User->>ExtendedQueryClient: executeNaturalLanguageQuery("Find John's 
friends")    ExtendedQueryClient->>ExtendedQueryClient: 
extractGraphSchema(context)    ExtendedQueryClient->>NL2CypherParser: 
parseNaturalLanguage(query, schema)    NL2CypherParser->>NL2CypherParser: 
preprocessQuery(naturalLanguage, schema)    NL2CypherParser->>LLMEngine: 
translateToCypher(context)    LLMEngine-->>NL2CypherParser: "MATCH (a:Person 
{name:'John'})-[:KNOWS]-(b:Person) RETURN b"    
NL2CypherParser->>NL2CypherParser: validator.validate(cypher, schema)    
NL2CypherParser->>GeaFlowDSLParser: parseStatement(cypher)    
GeaFlowDSLParser-->>NL2CypherParser: SqlNode    
NL2CypherParser-->>ExtendedQueryClient: SqlNode    
ExtendedQueryClient->>QueryEngine: executeQuery(sqlNode, context)    
QueryEngine-->>User: QueryResult```
## Implementation Plan
### Phase 1: Core Infrastructure (Weeks 1-2)- Extend `GeaFlowDSLParser` with 
NL2Cypher capabilities- Implement `NLQueryContext` and related data structures- 
Create basic LLM integration framework
### Phase 2: Query Processing (Weeks 3-4)- Implement natural language 
preprocessing- Develop Cypher generation logic- Integrate with existing 
validation pipeline
### Phase 3: Integration & Testing (Weeks 5-6)- Extend `QueryClient` for 
natural language support- Comprehensive testing with existing GQL test 
patterns:- Performance optimization and caching
## Current Status
### MeritocracyThis extension follows Apache GeaFlow's established development 
practices, building upon existing code review processes and contribution 
guidelines.
### CommunityThe extension leverages GeaFlow's active community while 
attracting new users from business intelligence and data science domains who 
need accessible graph analytics.
### AlignmentPerfect alignment with Apache GeaFlow's mission, utilizing 
existing Apache Calcite integration and following established DSL patterns.
## Known Risks
### Technical Risks- **LLM Accuracy**: Mitigation through validation pipeline 
and confidence scoring- **Performance Impact**: Addressed via caching and 
optimization strategies- **Schema Complexity**: Handled through incremental 
feature rollout
[Discuss][DFIP] GeaFlow NL2Cypher Extension

Reply via email to