Pushdown of joins in Apache Drill against DB2 is not done correctly if querying 
tables stored in schemas:

The query:
SELECT
        c.COUNTRY_CODE 
FROM
        gs_db2.GOSALES.COUNTRY as c
INNER JOIN
        gs_db2.GOSALES.SALES_TARGET t
ON      c.COUNTRY_CODE = t.COUNTRY_CODE_RETAILER

is converted to:

SELECT 
        *
FROM 
        "GOSALES"."COUNTRY"
INNER JOIN 
        "GOSALES"."SALES_TARGET” 
ON      "COUNTRY"."COUNTRY_CODE" = "SALES_TARGET".”COUNTRY_CODE_RETAILER"

When pushed down by Drill, which is not correct since:

Any table name references must contain schema name if schemas is used and no 
alias for the table exists i.e. ”GOSALES”.”COUNTRY”,”COUNTRY_CODE” instead of 
just ”COUNTRY”.”COUNTRY_CODE"

Since I am a database guy with close to 20 years with SQL as my primary work 
tool, I can not help to provide some small hints regarding the SQL generated: 
(Slightly OT but might be worth considering)

Quite many databases does not like to have fully qualified table names for the 
same table multiple times in the query. Some optimizers can actually decide to 
bring in the table multiple times (I’ve seen this happen on TD for instance) 
since it is strictly not the same table reference, it just refers to the same 
table which means you leave the choice to the optimizer and optimizers are 
often stupid. If you want to be strictly by the book in SQL and help the DB 
optimizer to understand the query, you define an alias for the tables and 
reference the alias wherever using that instance of the table to ensure it is 
the ”right instance” of the table you refer to.

Example:
SELECT 
        *
FROM 
        "GOSALES".”COUNTRY” AS "COUNTRY"
INNER JOIN 
        "GOSALES".”SALES_TARGET” AS ”SALES_TARGET" 
ON      "COUNTRY"."COUNTRY_CODE" = "SALES_TARGET".”COUNTRY_CODE_RETAILER”

This is a valid query and there’s no risk of the optimizer believing it can opt 
to include the table N times.

Another limit by not using aliases for tables in the pushed down query, is that 
you can only use a table once and only once in a query which is not very good 
since many times you refer to the same table multiple times but with different 
constraints or filters for different perspectives to be combined.

Last:
The * reference to columns of the table, and with two or more tables is not 
really a good option since column names might be duplicated and cause strange 
errors esp if they have different types. Maybe drill is good at this, but many 
databases are not good at dealing with duplicated column names in the same 
query, and you retrieve far more data than necessary to Apache Drill. 
IMHO: A possible solution to this since metadata is available about the tables 
and the columns to be retrieved by the query, and since most SQL knowledgeable 
persons always provide aliases for columns when retrieving same column twice to 
distinguish between them, the aliases are also available to be used in the 
pushed-down query, but this of course also means that join push down possibly 
need to rewrite the SELECT part of the query as well as pushing down fields in 
order to fully support pushdown of joins without errors and that might be hard.

Regards,
Magnus

 



Reply via email to