Hello Gao, Thank you for the effort and email. It is good to see that you have gone beyond the idea stage and started making concrete progress on the integration side.
Your proposed direction is interesting, and the selective schema injection approach sounds reasonable as a way to control prompt size and improve relevance. The emphasis on dataset retrieval, column pruning, and value hints is sensible for narrowing context before query generation. One important point to keep in mind is that this project should not be treated as only a schema-to-query problem. AsterixDB supports a large number of built-in functions, including many specialized SQL++ and spatial functions. For an NL2SQL++ assistant to be genuinely useful, it should be aware not only of datasets and fields, but also of the available functions, what inputs they expect, what they semantically do, and what kind of outputs they produce. That functional knowledge is important if the system is expected to generate meaningful queries beyond simple filter/project patterns. For the formal proposal, it would be useful to see: - a clearly scoped MVP, - the end-to-end architecture, - how schema and function knowledge will be represented and retrieved, - how generated queries will be validated or repaired/improved, - and a concrete evaluation plan with example query classes and success criteria. It would also help to discuss limitations and failure cases early, especially around ambiguity, unsupported intents, and safety of executing generated queries. Overall, this looks like a promising start. Best regards, Suryaa On Thu, Mar 26, 2026 at 8:19 PM Gao, Tianyang via dev < [email protected]> wrote: > Hi AsterixDB community, > > My name is Tianyang Gao and I am a Master student at Georgia Tech > interested in contributing to Apache AsterixDB as part of GSoC 2026. I > would like to apply for the NL2SQL++ project which adds natural language > query support to AsterixDB via LangChain4j and a RAG-based schema injection > pipeline. > > About me: > > I have been studying the AsterixDB codebase for the past few weeks > focusing on the query processing pipeline, the Hyracks HTTP server > infrastructure and the metadata system. I have already made an initial > contribution by bootstrapping the asterix spidersilk module. This includes > an IApiServerRegistrant ServiceLoader extension point and a skeleton > NL2SqlServlet registered on the JSON API server (port 19002). A JIRA ticket > will be filed shortly. > My coursework in Database Systems (CS6400) at Georgia Tech gave me a solid > foundation in SQL, schema design and metadata management. This background > is directly applicable to the schema extraction pipeline and to > understanding the semantic gap between natural language and structured > queries that this project aims to bridge. > > As a side project I built a YouTube-style video streaming website in Java > (available at https://github.com/pineappleBest123/imooc-youtube). This > gave me hands-on experience with Java web backends, REST API design and > working with large media datasets. > > My proposed approach: > > Rather than injecting the full schema into every LLM prompt I plan to > implement a three-layer selective schema injection strategy inspired by > CHESS and DAIL-SQL: > 1. Dataset retrieval via RAG (LangChain4j EmbeddingStoreContentRetriever) > 2. Column pruning via TF-IDF keyword scoring and semantic similarity > > 3. Value hints by > sampling actual field values from AsterixDB > > The system will support OpenAI and local Ollama models configurable via > cc.conf with multi-turn conversation context managed through LangChain4j's > TokenWindowChatMemory. > > I would love to get feedback on this direction and connect with the > mentors for this project. Is there anything specific you would like to see > in the formal GSoC proposal? > > Thank you for your time. > Best regards, > Tianyang Gao > >
