[PR] Gremlin Corpus Generation System Based on Recursive Backtracking [incubator-hugegraph-ai]

via GitHub Tue, 30 Sep 2025 07:18:26 -0700


LRriver opened a new pull request, #303:
URL: https://github.com/apache/incubator-hugegraph-ai/pull/303


   📋 **Project Overview**
   This PR adds a complete Text-to-Gremlin corpus generation system based on a 
recursive backtracking recipe-guided generation approach, capable of 
automatically generating large-scale and diverse training data from Gremlin 
query templates.
   
   🏗️ **Project Structure**
   ```
   AST_Text2Gremlin/                   # Project root directory
   ├── base/                           # Core system directory
   │   ├── generator.py                # Main generator entry point
   │   ├── GremlinTransVisitor.py      # ANTLR syntax tree visitor
   │   ├── TraversalGenerator.py       # Recursive backtracking generator
   │   ├── Schema.py                   # Graph database Schema management
   │   ├── GremlinBase.py              # Base component library
   │   ├── Config.py                   # Configuration management
   │   ├── cypher2gremlin_dataset.csv  # 3514 real query dataset
   │   └── test/                       # Test suite
   ├── config.json                     # Global configuration file
   ├── db_data/                        # Schema and data files
   └── README.md                       # Detailed technical documentation
   ```
   
   🎯 **Core Features**
   1. **Recipe-Guided Generation**
      - Parse Gremlin queries into Recipes using ANTLR
      - Perform intelligent parameter generalization based on Schema
      - Generate large numbers of valid variants through recursive backtracking
   
   2. **Large-Scale Data Processing**
      - Support batch loading of query templates from CSV files
      - Process 3514 real cypher2gremlin dataset entries
      - Global deduplication to ensure corpus quality
   
   3. **Complete Error Handling**
      - Support complex query types (g.call(), .with(), etc.)
      - Individual failures don't affect overall processing
      - Detailed statistics and error reporting
   
   4. **Intelligent Constraint Mechanism**
      - Schema connectivity validation
      - Syntax validity checking
      - Combinatorial explosion control (320k → 7k valid combinations)
   
   📊 **System Capabilities**
   - Query type support: V/E traversals, graph algorithm calls, complex 
filtering, etc.
   - Generation scale: Single complex template can generate 6000+ valid variants
   - Processing efficiency: Batch processing of 3514 templates with robust 
error handling
   - Output quality: JSON format with query-description pairs and detailed 
metadata
   
   🧪 **Technical Features**
   - Recursive backtracking algorithm: Systematically explore parameter 
combination space
   - Recipe abstraction: Structure queries into generalizable Recipes
   - Constraint optimization: 97%+ invalid combinations intelligently filtered
   - Modular design: Core components can be used and tested independently
   
   📈 **Application Value**
   - Text-to-Gremlin training: Provide large-scale training data for NLP models
   - Query diversity: Generate rich query variants from limited templates
   - Data quality: Ensure syntactic correctness and semantic reasonableness of 
generated queries
   - Extensibility: Support extension of new schemas and query types
   
   🔧 **Usage**
   ```python
   # Basic usage
   from generator import generate_corpus_from_templates
   
   templates = ["g.V().hasLabel('person')", "g.V().out('knows')"]
   result = generate_corpus_from_templates(templates)
   print(f"Generated {result['total_unique_queries']} unique queries")
   ```
   
   📋 **Documentation**
   - README.md: Quick start guide
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Gremlin Corpus Generation System Based on Recursive Backtracking [incubator-hugegraph-ai]

Reply via email to