Re: Introduction: GSoC 2026 Applicant interested in NL2SQL++ project

Gao, Tianyang via dev Mon, 06 Apr 2026 10:04:08 -0700

Dear Suryaa,

Thank you for the detailed feedback. I have revised the proposal accordingly 
and made concrete progress on several
fronts.


Addressing your feedback:

Function knowledge: I have added a dedicated fourth context layer (Task-6b) 
covering AsterixDB's built-in
  functions. The system enumerates all registered functions at runtime via  
FunctionCollection.getFunctionDescriptorFactories() (which also covers 
geo/fuzzyjoin extensions loaded via
  ServiceLoader), builds a semantic knowledge base with function signatures and 
descriptions, and retrieves relevant
  functions per query using vector similarity search. This ensures the system 
can generate queries beyond simple filter/project patterns.
MVP scope & architecture: The proposal now includes a clearly scoped MVP (Tasks 
1–8) covering schema RAG,
  four-layer context injection, multi-provider LLM support, and a two-stage 
validator with auto-retry. The end-to-end
   architecture diagram shows how each component interacts.
Validation & repair: Task-8 introduces a two-stage validation pipeline: a 
syntax validator using AsterixDB's own
  SQL++ parser, followed by an intent alignment checker that verifies 
structural consistency (dataset coverage,
  aggregation presence, filter presence) before feeding error context back into 
the retry loop.
Limitations & failure cases: Section 6 of the proposal discusses ambiguity 
handling, unsupported intents
  (surfaced via an interactive clarification stretch goal), and safety 
considerations around query execution.
Evaluation plan: The proposal includes a golden test suite (Section 10) 
covering five query complexity tiers —
  from simple lookups to multi-dataset aggregations and spatial queries — with 
both exact match and execution-based
  correctness metrics. This provides a concrete benchmark to measure the 
system's effectiveness across representative query classes.

The updated proposal: 
https://drive.google.com/file/d/15agxYa911Aktqcc1oeCT2UiSK5kmlDqw/view?usp=sharing

I also submitted a small bug fix to the AsterixDB codebase while exploring the 
compiler internals: ASTERIXDB-3021 —
implicit cross joins (FROM A, B) were silently omitting the cross product 
warning that explicit joins already emit. The fix reuses the existing 
JoinUtils.warnIfCrossProduct() mechanism.

PR: https://github.com/apache/asterixdb/pull/49

I would greatly appreciate any further feedback. Thank you so much.

Best regards,
Tianyang Gao

________________________________
From: Suryaa Charan Shivakumar <[email protected]>
Sent: Monday, March 30, 2026 10:51 PM
To: [email protected] <[email protected]>
Cc: Gao, Tianyang <[email protected]>
Subject: Re: Introduction: GSoC 2026 Applicant interested in NL2SQL++ project

You don't often get email from [email protected]. Learn why this 
is important<https://aka.ms/LearnAboutSenderIdentification>
Hello Gao,

Thank you for the effort and email. It is good to see that you have gone beyond 
the idea stage and started making concrete progress on the integration side.

Your proposed direction is interesting, and the selective schema injection 
approach sounds reasonable as a way to control prompt size and improve 
relevance. The emphasis on dataset retrieval, column pruning, and value hints 
is sensible for narrowing context before query generation.

One important point to keep in mind is that this project should not be treated 
as only a schema-to-query problem. AsterixDB supports a large number of 
built-in functions, including many specialized SQL++ and spatial functions. For 
an NL2SQL++ assistant to be genuinely useful, it should be aware not only of 
datasets and fields, but also of the available functions, what inputs they 
expect, what they semantically do, and what kind of outputs they produce. That 
functional knowledge is important if the system is expected to generate 
meaningful queries beyond simple filter/project patterns.

For the formal proposal, it would be useful to see:

  *   a clearly scoped MVP,

  *   the end-to-end architecture,

  *   how schema and function knowledge will be represented and retrieved,

  *   how generated queries will be validated or repaired/improved,

  *   and a concrete evaluation plan with example query classes and success 
criteria.

It would also help to discuss limitations and failure cases early, especially 
around ambiguity, unsupported intents, and safety of executing generated 
queries. Overall, this looks like a promising start.

Best regards,
Suryaa

On Thu, Mar 26, 2026 at 8:19 PM Gao, Tianyang via dev 
<[email protected]<mailto:[email protected]>> wrote:
Hi AsterixDB community,

My name is Tianyang Gao and I am a Master student at Georgia Tech interested in 
contributing to Apache AsterixDB as part of GSoC 2026. I would like to apply 
for the NL2SQL++ project which adds natural language query support to AsterixDB 
via LangChain4j and a RAG-based schema injection pipeline.

  About me:

I have been studying the AsterixDB codebase for the past few weeks focusing on 
the query processing pipeline, the Hyracks HTTP server infrastructure and the 
metadata system. I have already made an initial contribution by bootstrapping 
the asterix spidersilk module. This includes an IApiServerRegistrant 
ServiceLoader extension point and a skeleton NL2SqlServlet registered on the 
JSON API server (port 19002). A JIRA ticket will be filed shortly.
My coursework in Database Systems (CS6400) at Georgia Tech gave me a solid 
foundation in SQL, schema design and metadata management. This background is 
directly applicable to the schema extraction pipeline and to understanding the 
semantic gap between natural language and structured queries that this project 
aims to bridge.

As a side project I built a YouTube-style video streaming website in Java 
(available at https://github.com/pineappleBest123/imooc-youtube). This gave me 
hands-on experience with Java web backends, REST API design and working with 
large media datasets.

My proposed approach:

Rather than injecting the full schema into every LLM prompt I plan to implement 
a three-layer selective schema injection strategy inspired by CHESS and 
DAIL-SQL:
1. Dataset retrieval via RAG (LangChain4j EmbeddingStoreContentRetriever)
2. Column pruning via TF-IDF keyword scoring and semantic similarity            
                                                                                
                                        3. Value hints by sampling actual field 
values from AsterixDB

The system will support OpenAI and local Ollama models configurable via cc.conf 
with multi-turn conversation context managed through LangChain4j's 
TokenWindowChatMemory.

I would love to get feedback on this direction and connect with the mentors for 
this project. Is there anything specific you would like to see in the formal 
GSoC proposal?

Thank you for your time.
Best regards,
Tianyang Gao

Re: Introduction: GSoC 2026 Applicant interested in NL2SQL++ project

Reply via email to