Pushdown of joins in Apache Drill against DB2 is not done correctly if querying
tables stored in schemas:
The query:
SELECT
c.COUNTRY_CODE
FROM
gs_db2.GOSALES.COUNTRY as c
INNER JOIN
gs_db2.GOSALES.SALES_TARGET t
ON c.COUNTRY_CODE = t.COUNTRY_CODE_RETAILER
is converted to:
SELECT
*
FROM
"GOSALES"."COUNTRY"
INNER JOIN
"GOSALES"."SALES_TARGET”
ON "COUNTRY"."COUNTRY_CODE" = "SALES_TARGET".”COUNTRY_CODE_RETAILER"
When pushed down by Drill, which is not correct since:
Any table name references must contain schema name if schemas is used and no
alias for the table exists i.e. ”GOSALES”.”COUNTRY”,”COUNTRY_CODE” instead of
just ”COUNTRY”.”COUNTRY_CODE"
Since I am a database guy with close to 20 years with SQL as my primary work
tool, I can not help to provide some small hints regarding the SQL generated:
(Slightly OT but might be worth considering)
Quite many databases does not like to have fully qualified table names for the
same table multiple times in the query. Some optimizers can actually decide to
bring in the table multiple times (I’ve seen this happen on TD for instance)
since it is strictly not the same table reference, it just refers to the same
table which means you leave the choice to the optimizer and optimizers are
often stupid. If you want to be strictly by the book in SQL and help the DB
optimizer to understand the query, you define an alias for the tables and
reference the alias wherever using that instance of the table to ensure it is
the ”right instance” of the table you refer to.
Example:
SELECT
*
FROM
"GOSALES".”COUNTRY” AS "COUNTRY"
INNER JOIN
"GOSALES".”SALES_TARGET” AS ”SALES_TARGET"
ON "COUNTRY"."COUNTRY_CODE" = "SALES_TARGET".”COUNTRY_CODE_RETAILER”
This is a valid query and there’s no risk of the optimizer believing it can opt
to include the table N times.
Another limit by not using aliases for tables in the pushed down query, is that
you can only use a table once and only once in a query which is not very good
since many times you refer to the same table multiple times but with different
constraints or filters for different perspectives to be combined.
Last:
The * reference to columns of the table, and with two or more tables is not
really a good option since column names might be duplicated and cause strange
errors esp if they have different types. Maybe drill is good at this, but many
databases are not good at dealing with duplicated column names in the same
query, and you retrieve far more data than necessary to Apache Drill.
IMHO: A possible solution to this since metadata is available about the tables
and the columns to be retrieved by the query, and since most SQL knowledgeable
persons always provide aliases for columns when retrieving same column twice to
distinguish between them, the aliases are also available to be used in the
pushed-down query, but this of course also means that join push down possibly
need to rewrite the SELECT part of the query as well as pushing down fields in
order to fully support pushdown of joins without errors and that might be hard.
Regards,
Magnus