Hi AsterixDB community, My name is Tianyang Gao and I am a Master student at Georgia Tech interested in contributing to Apache AsterixDB as part of GSoC 2026. I would like to apply for the NL2SQL++ project which adds natural language query support to AsterixDB via LangChain4j and a RAG-based schema injection pipeline.
About me: I have been studying the AsterixDB codebase for the past few weeks focusing on the query processing pipeline, the Hyracks HTTP server infrastructure and the metadata system. I have already made an initial contribution by bootstrapping the asterix spidersilk module. This includes an IApiServerRegistrant ServiceLoader extension point and a skeleton NL2SqlServlet registered on the JSON API server (port 19002). A JIRA ticket will be filed shortly. My coursework in Database Systems (CS6400) at Georgia Tech gave me a solid foundation in SQL, schema design and metadata management. This background is directly applicable to the schema extraction pipeline and to understanding the semantic gap between natural language and structured queries that this project aims to bridge. As a side project I built a YouTube-style video streaming website in Java (available at https://github.com/pineappleBest123/imooc-youtube). This gave me hands-on experience with Java web backends, REST API design and working with large media datasets. My proposed approach: Rather than injecting the full schema into every LLM prompt I plan to implement a three-layer selective schema injection strategy inspired by CHESS and DAIL-SQL: 1. Dataset retrieval via RAG (LangChain4j EmbeddingStoreContentRetriever) 2. Column pruning via TF-IDF keyword scoring and semantic similarity 3. Value hints by sampling actual field values from AsterixDB The system will support OpenAI and local Ollama models configurable via cc.conf with multi-turn conversation context managed through LangChain4j's TokenWindowChatMemory. I would love to get feedback on this direction and connect with the mentors for this project. Is there anything specific you would like to see in the formal GSoC proposal? Thank you for your time. Best regards, Tianyang Gao
