Hello everyone, with the surge of AI, more and more users are accustomed to
asking questions in natural language. However, the current graph computing
languages GQL and Cypher syntax still require users to understand a certain
level of complexity. I found that GeaFlow does not yet support NL2Cypher. I
would like to propose a proposal and hope that you will consider it.
# GeaFlow NL2Cypher Extension
## Abstract
GeaFlow NL2Cypher extends Apache GeaFlow's existing DSL capabilities to support
natural language to Cypher query translation, enabling users to query graph
databases using plain English instead of complex GQL syntax.
## Proposal
This proposal extends Apache GeaFlow's distributed streaming graph computing
engine with Natural Language to Cypher (NL2Cypher) translation capabilities.
The extension integrates with GeaFlow's existing DSL architecture to provide a
seamless natural language interface for graph queries
## Background
Graph databases require specialized query language knowledge, creating barriers
for non-technical users. GeaFlow's current DSL uses a compiler-based approach
with syntax analysis, semantic analysis, and code generation phases.
The existing `GeaFlowDSLParser` provides the foundation for extending to
natural language input
## Technical Implementation
### Core Data Structures
**1. NL2Cypher Query Request Structure**```javapublic class NLQueryRequest {
private String naturalLanguage; private String graphSchema; private
Map<String, Object> context; private QueryOptions options;}
public class CypherResponse { private String generatedCypher; private
double confidence; private List<String> alternatives; private
QueryValidationResult validation;}```
**2. Extended Parser Architecture**
Building on the existing `GeaFlowDSLParser` structure:
```javapublic class NL2CypherParser extends GeaFlowDSLParser { private final
LLMInferenceEngine llmEngine; private final QueryValidator validator;
public SqlNode parseNaturalLanguage(String naturalLanguage, GraphSchema schema)
{ // 1. Preprocess natural language input NLQueryContext context
= preprocessQuery(naturalLanguage, schema); // 2. Generate
Cypher using LLM String cypher = llmEngine.translateToCypher(context);
// 3. Validate generated query ValidationResult result =
validator.validate(cypher, schema); // 4. Parse to SqlNode using
existing infrastructure return parseStatement(cypher); }}```
**3. Integration with Existing Query Processing**
The implementation leverages GeaFlow's existing `QueryClient` architecture:
```javapublic class ExtendedQueryClient extends QueryClient { private final
NL2CypherParser nlParser = new NL2CypherParser(); public QueryResult
executeNaturalLanguageQuery(String naturalLanguage, QueryContext context) {
try { // Extract graph schema from context GraphSchema
schema = extractGraphSchema(context); // Convert NL to
SqlNode SqlNode sqlNode =
nlParser.parseNaturalLanguage(naturalLanguage, schema);
// Use existing execution pipeline return executeQuery(sqlNode,
context); } catch (Exception e) { throw new
GeaFlowDSLException("Error in NL query execution: " + naturalLanguage, e);
} }}```
### Architecture Design
**Main Processing Pipeline:**```mermaidgraph TB subgraph "Input Layer"
NL["Natural Language Query"] Schema["Graph Schema Context"] end
subgraph "NL2Cypher Module" Preprocessor["Query Preprocessor"]
LLM["LLM Inference Engine"] Generator["Cypher Generator"]
Validator["Query Validator"] end subgraph "Existing GeaFlow DSL"
Parser["GeaFlowDSLParser"] Context["GQLContext"] Planner["Query
Planner"] end NL --> Preprocessor Schema --> Preprocessor
Preprocessor --> LLM LLM --> Generator Generator --> Validator
Validator --> Parser Parser --> Context Context --> Planner```
**4. Graph Schema Integration**
Leveraging existing `GeaFlowGraph` structure:
```javapublic class NLQueryContext { private final String naturalLanguage;
private final GraphRecordType graphType; private final Map<String,
EntityInfo> entities; private final List<RelationshipInfo> relationships;
public static NLQueryContext from(String query, GeaFlowGraph graph,
RelDataTypeFactory typeFactory) { GraphRecordType graphType =
(GraphRecordType) graph.getRowType(typeFactory); return new
NLQueryContext(query, graphType, extractEntities(query),
extractRelationships(query)); }}```
**5. Validation Integration**
Building on existing validation infrastructure:
```javapublic class NL2CypherValidator extends GQLValidatorImpl { public
ValidationResult validateGeneratedCypher(String cypher, GraphSchema schema) {
try { // Parse generated Cypher SqlNode sqlNode =
parser.parseStatement(cypher); // Validate using
existing infrastructure SqlNode validated = validate(sqlNode);
return ValidationResult.success(validated); } catch
(Exception e) { return ValidationResult.failure(e.getMessage());
} }}```
### Module Interaction Sequence
```mermaidsequenceDiagram participant User participant
ExtendedQueryClient participant NL2CypherParser participant LLMEngine
participant GeaFlowDSLParser participant QueryEngine
User->>ExtendedQueryClient: executeNaturalLanguageQuery("Find John's
friends") ExtendedQueryClient->>ExtendedQueryClient:
extractGraphSchema(context) ExtendedQueryClient->>NL2CypherParser:
parseNaturalLanguage(query, schema) NL2CypherParser->>NL2CypherParser:
preprocessQuery(naturalLanguage, schema) NL2CypherParser->>LLMEngine:
translateToCypher(context) LLMEngine-->>NL2CypherParser: "MATCH (a:Person
{name:'John'})-[:KNOWS]-(b:Person) RETURN b"
NL2CypherParser->>NL2CypherParser: validator.validate(cypher, schema)
NL2CypherParser->>GeaFlowDSLParser: parseStatement(cypher)
GeaFlowDSLParser-->>NL2CypherParser: SqlNode
NL2CypherParser-->>ExtendedQueryClient: SqlNode
ExtendedQueryClient->>QueryEngine: executeQuery(sqlNode, context)
QueryEngine-->>User: QueryResult```
## Implementation Plan
### Phase 1: Core Infrastructure (Weeks 1-2)- Extend `GeaFlowDSLParser` with
NL2Cypher capabilities- Implement `NLQueryContext` and related data structures-
Create basic LLM integration framework
### Phase 2: Query Processing (Weeks 3-4)- Implement natural language
preprocessing- Develop Cypher generation logic- Integrate with existing
validation pipeline
### Phase 3: Integration & Testing (Weeks 5-6)- Extend `QueryClient` for
natural language support- Comprehensive testing with existing GQL test
patterns:- Performance optimization and caching
## Current Status
### MeritocracyThis extension follows Apache GeaFlow's established development
practices, building upon existing code review processes and contribution
guidelines.
### CommunityThe extension leverages GeaFlow's active community while
attracting new users from business intelligence and data science domains who
need accessible graph analytics.
### AlignmentPerfect alignment with Apache GeaFlow's mission, utilizing
existing Apache Calcite integration and following established DSL patterns.
## Known Risks
### Technical Risks- **LLM Accuracy**: Mitigation through validation pipeline
and confidence scoring- **Performance Impact**: Addressed via caching and
optimization strategies- **Schema Complexity**: Handled through incremental
feature rollout