Hi AsterixDB community,

My name is Tianyang Gao and I am a Master student at Georgia Tech interested in 
contributing to Apache AsterixDB as part of GSoC 2026. I would like to apply 
for the NL2SQL++ project which adds natural language query support to AsterixDB 
via LangChain4j and a RAG-based schema injection pipeline.

  About me:

I have been studying the AsterixDB codebase for the past few weeks focusing on 
the query processing pipeline, the Hyracks HTTP server infrastructure and the 
metadata system. I have already made an initial contribution by bootstrapping 
the asterix spidersilk module. This includes an IApiServerRegistrant 
ServiceLoader extension point and a skeleton NL2SqlServlet registered on the 
JSON API server (port 19002). A JIRA ticket will be filed shortly.
My coursework in Database Systems (CS6400) at Georgia Tech gave me a solid 
foundation in SQL, schema design and metadata management. This background is 
directly applicable to the schema extraction pipeline and to understanding the 
semantic gap between natural language and structured queries that this project 
aims to bridge.

As a side project I built a YouTube-style video streaming website in Java 
(available at https://github.com/pineappleBest123/imooc-youtube). This gave me 
hands-on experience with Java web backends, REST API design and working with 
large media datasets.

My proposed approach:

Rather than injecting the full schema into every LLM prompt I plan to implement 
a three-layer selective schema injection strategy inspired by CHESS and 
DAIL-SQL:
1. Dataset retrieval via RAG (LangChain4j EmbeddingStoreContentRetriever)
2. Column pruning via TF-IDF keyword scoring and semantic similarity            
                                                                                
                                        3. Value hints by sampling actual field 
values from AsterixDB

The system will support OpenAI and local Ollama models configurable via cc.conf 
with multi-turn conversation context managed through LangChain4j's 
TokenWindowChatMemory.

I would love to get feedback on this direction and connect with the mentors for 
this project. Is there anything specific you would like to see in the formal 
GSoC proposal?

Thank you for your time.
Best regards,
Tianyang Gao

Reply via email to