LRriver opened a new pull request, #303:
URL: https://github.com/apache/incubator-hugegraph-ai/pull/303
๐ **Project Overview**
This PR adds a complete Text-to-Gremlin corpus generation system based on a
recursive backtracking recipe-guided generation approach, capable of
automatically generating large-scale and diverse training data from Gremlin
query templates.
๐๏ธ **Project Structure**
```
AST_Text2Gremlin/ # Project root directory
โโโ base/ # Core system directory
โ โโโ generator.py # Main generator entry point
โ โโโ GremlinTransVisitor.py # ANTLR syntax tree visitor
โ โโโ TraversalGenerator.py # Recursive backtracking generator
โ โโโ Schema.py # Graph database Schema management
โ โโโ GremlinBase.py # Base component library
โ โโโ Config.py # Configuration management
โ โโโ cypher2gremlin_dataset.csv # 3514 real query dataset
โ โโโ test/ # Test suite
โโโ config.json # Global configuration file
โโโ db_data/ # Schema and data files
โโโ README.md # Detailed technical documentation
```
๐ฏ **Core Features**
1. **Recipe-Guided Generation**
- Parse Gremlin queries into Recipes using ANTLR
- Perform intelligent parameter generalization based on Schema
- Generate large numbers of valid variants through recursive backtracking
2. **Large-Scale Data Processing**
- Support batch loading of query templates from CSV files
- Process 3514 real cypher2gremlin dataset entries
- Global deduplication to ensure corpus quality
3. **Complete Error Handling**
- Support complex query types (g.call(), .with(), etc.)
- Individual failures don't affect overall processing
- Detailed statistics and error reporting
4. **Intelligent Constraint Mechanism**
- Schema connectivity validation
- Syntax validity checking
- Combinatorial explosion control (320k โ 7k valid combinations)
๐ **System Capabilities**
- Query type support: V/E traversals, graph algorithm calls, complex
filtering, etc.
- Generation scale: Single complex template can generate 6000+ valid variants
- Processing efficiency: Batch processing of 3514 templates with robust
error handling
- Output quality: JSON format with query-description pairs and detailed
metadata
๐งช **Technical Features**
- Recursive backtracking algorithm: Systematically explore parameter
combination space
- Recipe abstraction: Structure queries into generalizable Recipes
- Constraint optimization: 97%+ invalid combinations intelligently filtered
- Modular design: Core components can be used and tested independently
๐ **Application Value**
- Text-to-Gremlin training: Provide large-scale training data for NLP models
- Query diversity: Generate rich query variants from limited templates
- Data quality: Ensure syntactic correctness and semantic reasonableness of
generated queries
- Extensibility: Support extension of new schemas and query types
๐ง **Usage**
```python
# Basic usage
from generator import generate_corpus_from_templates
templates = ["g.V().hasLabel('person')", "g.V().out('knows')"]
result = generate_corpus_from_templates(templates)
print(f"Generated {result['total_unique_queries']} unique queries")
```
๐ **Documentation**
- README.md: Quick start guide
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]