Dear Sergei Golubchik,

That's a good question! I would like to discuss the VIDEX architecture, which 
is inspired by industry practices at ByteDance, particularly in its cloud 
environment. Additionally, you may refer to the VIDEX arXiv paper, which 
presents further design considerations [1].

VIDEX aims to simulate the query plans of a target instance (with large volume 
data) in the absence of actual data. We categorize the information requested by 
the upper-layer optimizer from the underlying engine into two types:

1. **Simple statistic data**: This can be straightforwardly and efficiently 
obtained from the target instance (e.g., mysql.innodb_index_stats, 
information_schema.TABLES, etc.). This includes data such as table_rows, 
AVG_ROW_LENGTH, and clustered_index_size.
2. **Complex statistic data**: These include complex statistics that cannot be 
directly obtained and usually require estimation, primarily multi-column NDV 
(number of distinct values) estimation needed for records_per_key, and 
multi-column cardinality estimation needed for records_in_range. Considering 
that what-if analysis might create indexes for any combination of columns (up 
to 2^N combinations), we must employ heuristic or AI-boosted algorithmic models 
for estimation, rather than enumerating and pre-storing all combinations.

## Q1: Why are AI-boosted cardinality and NDV algorithms required?

Simple statistics are directly logged into system tables. However, accurately 
estimating multi-column joint distributions poses significant challenges. 
Rule-based or heuristic methods often result in inaccuracies (e.g. PostgreSQL's 
native implementation). Most of these research papers [2] employ query-driven 
or data-driven AI models to enhance the precision of NDV and cardinality 
estimates. The advantage of AI models is that they learn data distributions 
from data or queries without actually creating indexes or running statistics, 
providing more generalized and accurate estimates of cardinality and NDV.

## Q2: Why separate a RESTful Statistic server from the VIDEX engine side?

Although much research has been conducted, most of it is not practically 
applicable in production. For instance, the Postgres cardinality hacking 
approach [3] requires researchers to write fixed-order cardinality results to a 
file, which the Postgres optimizer then reads. This method is not conducive to 
runtime operations.
Moreover, AI researchers encounter substantial challenges when integrating AI 
models into database engines. On one hand, these engines must manage complex 
underlying interfaces; on the other, AI models frequently necessitate Python 
and GPU environments.

We advocate for a seperate RESTful Statistic server that facilitates the 
seamless integration of various models (AI and non-AI) into the query plans of 
MariaDB/MySQL. This method has proven to be genuinely practical. Validated 
through extensive use at ByteDance's cloud service, VIDEX processes over 5,000 
index recommendation tasks and tens of thousands of slow SQL queries daily. 
Furthermore, we have developed and successfully integrated multiple NDV and 
cardinality algorithms into VIDEX, accepted by VLDB and SIGMOD [4][5][6].

## Q3: How to Balance Estimation Accuracy and Architectural Simplicity

In the current implementation, we forward all requests (whatever simple or 
complex) to the server for simplicity. However, this involved an additional 
server module. Ideally, we believe VIDEX can strike a balance between 
estimation accuracy and architectural simplicity: 

1. **Heuristic estimation mode**: All statistical data is stored in the 
database's statistic tables (e.g., table_rows, cluster_size, single-column NDV, 
single-column histograms). The VIDEX plugin uses simple assumptions (e.g., 
independence between columns, uniform distribution across a single column) to 
provide estimates for records_in_range and records_per_key.
2. **AI-boosted estimation mode**: To enhance accuracy, complex multi-column 
joint NDV and cardinality estimates are forwarded to a RESTful statistic 
server, which can be developed using Python, C, or Java, and deployed on CPU or 
GPU environments.

[1] VIDEX design paper: Rong Kang, Shuai Wang, et al. VIDEX: A Disaggregated 
and Extensible Virtual Index for the Cloud and AI Era, 
https://arxiv.org/abs/2503.23776
[2] AI models in DB optimization: Yuxing Han et al., “Cardinality estimation in 
DBMS: a comprehensive benchmark evaluation,” Proc. VLDB 2021.
[3] PG cardinality hacking: 
https://github.com/Nathaniel-Han/End-to-End-CardEst-Benchmark/
[3] VIDEX NDV paper 1: Xianghong Xu, Tieying Zhang, et al, AdaNDV: Adaptive 
Number of Distinct Value Estimation via Learning to Select and Fuse Estimators, 
VLDB, 2025 (accepted)
[4] VIDEX NDV paper 2: Xianghong Xu, Xiao He, et al, PLM4NDV: Minimizing Data 
Access for Number of Distinct Values Estimation with Pre-trained Language 
Models, SIGMOD, 2025 (accepted)
[5] VIDEX Cardinality paper 3: Peizhi Wu, Rong Kang, et al. Data-Agnostic 
Cardinality Learning from Imperfect Workloads, VLDB, 2025 (accepted)

I hope this addresses your concerns. If you have any further questions, please 
feel free to continue the discussion.

Best,
Rong,
ByteBrain Team, ByteDance
_______________________________________________
discuss mailing list -- discuss@lists.mariadb.org
To unsubscribe send an email to discuss-le...@lists.mariadb.org

Reply via email to