kaori-seasons commented on issue #60951:
URL: https://github.com/apache/doris/issues/60951#issuecomment-4002877928
> Thank you for such a comprehensive and well-structured proposal! This is a
high-quality design document that clearly outlines the technical path for
Snowflake dialect compatibility. We really appreciate the detailed analysis of
the function signatures and the "decomposable" implementation strategy.
>
> We agree with your design principles, especially **G-5 (not altering
existing function behavior)** and the prioritization of FE-layer solutions. The
tiered classification (Tier 0-5) is very practical.
>
> Before we proceed, we have a few questions regarding the overall strategy
and future maintainability:
>
> 1. **Compatibility Coverage Measurement**: How do you plan to measure the
final compatibility coverage? For the goal of ">90% high-frequency function
compatibility," will you be using a curated list of Snowflake functions to
validate this metric?
> 2. **Handling Hard-to-Compat Functions**: For functions that cannot be
easily handled by FE aliases or rewriting (e.g., complex `LATERAL FLATTEN`
usage or `OBJECT_CONSTRUCT` NULL-skipping), do you envision these being handled
primarily by the Dialect Converter Plugin? If so, should we consider
consolidating more logic into the SQL Converter to keep the core engine clean,
or do you see value in implementing specific BE functions for them?
> 3. **Dialect Conflict Management**: Regarding potential conflicts, if we
introduce support for other dialects in the future (e.g., Databricks SQL), how
do we plan to handle function name collisions or semantic differences (like the
`SPLIT` parameter order you mentioned)? Should the alias registration mechanism
be made dialect-aware to ensure isolation?
>
> We look forward to your thoughts on the questions above!
@morningman
Your insights are very insightful and comprehensive. Please allow me to
address each of the following questions:
### Question 1: How to Measure Compatibility Coverage?
**Question**:
> How do you plan to measure the final compatibility coverage? For the goal
of ">90% high-frequency function compatibility," will you be using a curated
list of Snowflake functions to validate this metric?
**Thoughts**:
A single percentage metric can be misleading—for example, "supporting 80% of
functions" might include many low-frequency niche functions. We recommend
adopting a **multi-dimensional measurement framework**:
| Measurement Dimension | Description | Data Source |
|----------------------|-------------|-------------|
| **Function-level Coverage** | Compatible functions / Total functions |
Snowflake official documentation |
| **High-frequency Weighted Coverage** | Weighted calculation by usage
frequency | Real customer query logs (anonymized) |
| **SQL Direct Execution Rate** | Proportion of SQL that executes without
modification | 10,000 migration SQL corpus |
**Phased Goals**:
```
After Phase 1 → High-frequency Top 50 coverage ≥ 60%
After Phase 2 → High-frequency Top 50 coverage ≥ 85%
After Phase 3 → SQL direct execution rate ≥ 90%
```
**Follow-up Actions**: We are willing to maintain a compatibility matrix
document that clearly marks each Snowflake function's compatibility status (
Fully Compatible / Approximately Compatible / Not Compatible).
---
### Question 2: How to Handle Hard-to-Compat Functions?
**Question**:
> For functions that cannot be easily handled by FE aliases or rewriting
(e.g., complex LATERAL FLATTEN usage or OBJECT_CONSTRUCT NULL-skipping), do you
envision these being handled primarily by the Dialect Converter Plugin? If so,
should we consider consolidating more logic into the SQL Converter to keep the
core engine clean, or do you see value in implementing specific BE functions
for them?
**Thoughts**:
This is an excellent architectural trade-off question. Our core position is:
**Converter handles syntax-level differences, FE/BE handles runtime semantic
differences**.
**Why can't OBJECT_CONSTRUCT rely on Converter?**
Snowflake's `OBJECT_CONSTRUCT` has a special semantic: when value is NULL,
skip that key. This requires **runtime judgment**—when Converter works at the
SQL text level, it cannot know whether an expression (possibly a column
reference, subquery result, etc.) is NULL.
```
-- Converter can only see this text, cannot know if column_value is NULL at
runtime
OBJECT_CONSTRUCT('key1', column_value, 'key2', 'static_value')
```
**Decision Matrix**:
| Function/Syntax |归属 | Reason |
|----------------|------|--------|
| `OBJECT_CONSTRUCT` NULL-skipping | New BE function | Requires runtime
judgment |
| `FLATTEN` TVF basic version | New BE Table Function | Outputs multiple
rows and columns, Converter cannot simulate |
| `LATERAL FLATTEN` syntax | Dialect Converter | Pure syntax-level
transformation |
| `QUALIFY` clause | Dialect Converter | Rewrite to subquery |
| `SPLIT` parameter order conflict | Dialect Converter | Don't change
existing behavior, swap under dialect mode |
**View on "Keeping the Core Engine Clean"**:
We understand the community's concerns, but **semantic correctness must come
first**. If Converter cannot guarantee correctness, new logic should be
implemented in FE/BE. This also aligns with the G-2 principle in our proposal.
---
### Question 3: How to Manage Multi-Dialect Conflicts?
**Question**:
> Regarding potential conflicts, if we introduce support for other dialects
in the future (e.g., Databricks SQL), how do we plan to handle function name
collisions or semantic differences (like the SPLIT parameter order you
mentioned)? Should the alias registration mechanism be made dialect-aware to
ensure isolation?
**Thoughts**:
This is a key question for architectural foresight. We recommend adopting a
**progressive dialect isolation strategy**:
#### Short-term Solution (Phase 1-2)
**No dialect awareness needed**. The 16 pure aliases + 7 expression rewrite
functions at this stage **do not involve conflicts**—`IFF`, `LEN`, `CHARINDEX`,
etc., do not exist in current dialects.
For the `SPLIT` parameter order conflict (Snowflake: `SPLIT(str, delim)` vs
Doris: `SPLIT(delim, str)`), we document the difference and suggest Snowflake
users explicitly use `split_by_string(delim, str)`.
#### Mid-term Solution (Phase 5)
Introduce **Session-level dialect mode**:
```sql
-- Set dialect mode
SET SESSION sql_dialect = 'snowflake';
-- Under this session, SPLIT automatically swaps parameters
SELECT SPLIT('hello,world', ','); -- Internally converts to
split_by_string(',', 'hello,world')
```
#### Long-term Architecture
Function lookup supports **(function_name, dialect, parameter_types)**
triple-key indexing:
```
("split", DORIS, (VARCHAR, VARCHAR)) → SplitByString // delim first
("split", SNOWFLAKE, (VARCHAR, VARCHAR)) → SplitSnowflake // string first
("split", DATABRICKS, (VARCHAR, VARCHAR)) → SplitByString // delim first
```
**Outlook for Databricks SQL**:
If Databricks is supported in the future, we expect `SPLIT` to also have
conflicts (Databricks and Snowflake have consistent parameter order, both
string first). This means we need a more general dialect isolation mechanism.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]