terrymanu commented on issue #37852:
URL:
https://github.com/apache/shardingsphere/issues/37852#issuecomment-3800335964
Understanding
- On 5.5.2 with PostgreSQL 17, SELECT DISTINCT ... FROM t24 over a sharded
table via Proxy returns duplicates (including NULL), while running directly on
a single PG node returns globally deduplicated rows. Logs show the query is
rewritten into per-shard SELECT DISTINCT ... combined with UNION ALL, so global
DISTINCT semantics are lost.
Root Cause
- In
infra/rewrite/core/src/main/java/org/apache/shardingsphere/infra/rewrite/engine/RouteSQLRewriteEngine.java,
isNeedAggregateRewrite allows aggregate rewrite when there is no
subquery/join, ORDER BY, pagination, or lock. It does not exclude DISTINCT, so
DISTINCT queries are still rewritten into DISTINCT ... UNION ALL ....
- After this rewrite, the merge layer does not re-apply a global DISTINCT,
so only per-shard distinct + global UNION ALL remains, breaking SQL semantics.
Analysis
- DISTINCT is a global set operation. Splitting it into per-shard DISTINCT
and stitching with UNION ALL inevitably reintroduces duplicates.
- ShardingDQLResultMerger (see
features/sharding/core/src/main/java/org/apache/shardingsphere/sharding/merge/dql/ShardingDQLResultMerger.java)
can handle global distinct via ProjectionsContext.isDistinctRow(), but once
the rewrite has already produced UNION ALL output, the merge stage cannot
restore global deduplication.
- The defect is therefore in the rewrite stage: DISTINCT must be a blocker
for aggregate rewrite, or the rewrite must add a compensating global
distinct/aggregation (currently absent).
Conclusion
- This is a correctness bug in 5.5.2: DISTINCT is not protected from
aggregate rewrite, so global deduplication is lost.
Design-Level Fix Suggestion
- Block aggregate rewrite when DISTINCT is present to preserve global
semantics:
```java
private boolean isNeedAggregateRewrite(final SQLStatementContext
sqlStatementContext,
final Collection<RouteUnit>
routeUnits,
final int
maxUnionSizePerDataSource) {
if (!(sqlStatementContext instanceof SelectStatementContext) || 1 ==
routeUnits.size() || 1 == maxUnionSizePerDataSource) {
return false;
}
SelectStatementContext statementContext = (SelectStatementContext)
sqlStatementContext;
if (statementContext.getProjectionsContext().isDistinctRow()) {
return false; // DISTINCT requires global semantics; skip
aggregate rewrite
}
boolean containsSubqueryJoinQuery =
statementContext.isContainsSubquery() || statementContext.isContainsJoinQuery();
boolean containsOrderByLimitClause =
!statementContext.getOrderByContext().getItems().isEmpty()
|| statementContext.getPaginationContext().isHasPagination();
boolean containsLockClause =
statementContext.getSqlStatement().getLock().isPresent();
boolean result = !containsSubqueryJoinQuery &&
!containsOrderByLimitClause && !containsLockClause;
statementContext.setNeedAggregateRewrite(result);
return result;
}
```
- Tests to add:
- In infra/rewrite/core tests, add a DISTINCT case and assert
aggregate rewrite is skipped.
- In features/sharding/core merge tests, add a multi-shard DISTINCT
case to verify global deduplication.
We warmly invite community contributors to submit a PR with this change
and the accompanying tests; I’ll be happy to review and help get it merged.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]