[PR] ASTERIXDB-PR2: Add schema extraction pipeline for NL2SQL++ (SchemaContextBuilder) [asterixdb]

via GitHub Fri, 27 Mar 2026 16:18:15 -0700


pineappleBest123 opened a new pull request, #47:
URL: https://github.com/apache/asterixdb/pull/47


   ## Summary
               
     Add the schema extraction pipeline for the GSoC 2026 NL2SQL++ project.     
                                                                                
    
     This patch builds on top of the servlet infrastructure introduced in  
     https://github.com/apache/asterixdb/pull/46.                               
                                                                                
    
                                                                                
                                                                                
    
     ### Changes                                                                
                                                                                
    
     - `ColumnInfo`: field name, type string, and primary-key flag with         
                                                                                
    
       prompt-ready `toDescriptionString()` output                              
                                                                                
    
     - `DatasetSchema`: holds all columns, supports pruned column subset        
                                                                                
    
       (for ColumnPruner in a later PR) and value hints (for ValueHintsSampler) 
                                                                                
    
     - `DatasetSchemaFormatter`: recursively converts ADM `IAType` objects to   
                                                                                
    
       human-readable strings (supports nested records, arrays, multisets,      
                                                                                
    
       nullable unions, depth limit of 4)                                       
                                                                                
    
     - `SchemaContextBuilder`: reads Dataset and type metadata from             
                                                                                
    
       `MetadataManager`, builds a `SchemaContext` with one description         
                                                                                
    
       string per Dataset, wrapped in a metadata transaction                    
                                                                                
    
     - 13 unit tests covering all formatter rules and schema pipeline behavior  
                                                                                
    
                                                                                
                                                                                
    
     ### Example output                                                         
                                                                                
    
     Dataset TweetMessages (tweetid: int64 [PK], sender-location: any,          
                                                                                
    
         send-time: datetime, referred-topics: [string], message-text: string,  
                                                                                
    
         author-id: int64)                                                      
                                                                                
    
                                                                                
                                                                                
    
     ### Testing                                                                
                                                                                
    
     All unit tests pass: `mvn test -pl asterixdb/asterix-spidersilk`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] ASTERIXDB-PR2: Add schema extraction pipeline for NL2SQL++ (SchemaContextBuilder) [asterixdb]

Reply via email to