Re: Why are JOIN/correlation predicates not pushed down as IN lists across JDBC datasources?

Julian Hyde Fri, 16 May 2025 07:27:23 -0700

The short answer to your “why?” is that no one ever did the work to make this 
happen.

The work would be a planner rule plus a cost model. 

But please remember that there is usually no good plan for a federated join. 
There are a few algorithms (e.g. using bloom filters) that may help a little, 
but if each query ends up shipping a significant fraction of your table over 
the WAN, you’ve lost. Just replicate one of the tables. 

Why, then, does Calcite make it possible to write federated queries? Because 
you can’t answer a question unless you can ask it. The best way to 
operationalize a federated query is to design a data replication strategy: 
replicas, other materialized views, and a process to keep them up to date. 

Julian

> On May 16, 2025, at 06:53, Gavin Ray <ray.gavi...@gmail.com> wrote:
> 
> Suppose I have two Calcite schemas from JDBC sources.
> 
> One contains `users` and the other `user_todos`.
> 
> If I run a query like:
> 
>    SELECT u.name, t.todo
>    FROM USERSDB.users u
>    LEFT JOIN TODOSDB.user_todos t ON u.id = t.user_id
>    GROUP BY u.id, u.name, t.todo
> 
> What seems like the most logical thing to do would be to run:
> 
>    SELECT u.name
>    FROM USERSDB.users u
> 
> And then consecutively:
> 
>    SELECT t.todo
>    FROM TODOSDB.user_todos
>    WHERE t.user_id IN (...)
> 
> But what happens is that the entire `user_todos` is scanned:
> 
>    /**/stat2.execute("SELECT \"USER_ID\", \"TODO\"\nFROM
> \"USER_TODOS\"\nORDER BY \"USER_ID\" NULLS LAST");
>    2025-05-16 09:17:54.191653-04:00 jdbc[10]: Plan       : calculate cost
> for plan [PUBLIC.USER_TODOS]
>    2025-05-16 09:17:54.191766-04:00 jdbc[10]: Plan       :   for table
> filter PUBLIC.USER_TODOS
>    2025-05-16 09:17:54.192225-04:00 jdbc[10]: Table      :     potential
> plan item cost 10,230 index PUBLIC.USER_TODOS.tableScan
>    2025-05-16 09:17:54.192361-04:00 jdbc[10]: Table      :     potential
> plan item cost 12,240 index PUBLIC.PRIMARY_KEY_C
>    2025-05-16 09:17:54.192465-04:00 jdbc[10]: Plan       :   best plan
> item cost 10,230 index PUBLIC.USER_TODOS.tableScan
> 
> Is this intentional?
> Why not push down JOIN predicates or correlative predicates to the JDBC
> source?

Re: Why are JOIN/correlation predicates not pushed down as IN lists across JDBC datasources?

Reply via email to