Anton Uvarov created SPARK-45198:
------------------------------------
Summary: problem using broadcast join with parquet/iceberg tables
Key: SPARK-45198
URL: https://issues.apache.org/jira/browse/SPARK-45198
Project: Spark
Issue Type: Bug
Components: Build
Affects Versions: 3.4.1
Reporter: Anton Uvarov
We have 2 Parquet tables: load_test_full_warehouse.gen_document_type and
load_test_full_warehouse.generation_document_part.
Trying to make a left join of these two tables onto each other gives a strange
result. In the case where on the left side of the join we use a large table
load_test_full_warehouse.generation_document_part, the optimizer uses a
broadcast join.
However, in the case when on the left in the join we use a small reference
table, the optimizer chooses to execute the query using the merge sort.
Although it would seem that the small table on the left in a left join should
initiate a broadcast join.
An attempt to use hints and collect statistics did not yield results. The
following queries were used:
spark.sql(f"""create table iceberg_warehouse.t1 using iceberg
as SELECT /*+ BROADCAST(doc_tp) */
doc.DOCUMENT_DATE
, doc_tp.NAME as DOCUMENT_TYPE
, COUNT(*) as DOC_QTY
FROM load_test_full_warehouse.generation_document_part doc
LEFT JOIN load_test_full_warehouse.gen_document_type doc_tp ON
doc.DOCUMENT_TYPE_ID_INT = doc_tp.DOCUMENT_TYPE_ID_INT
GROUP BY doc.DOCUMENT_DATE, doc_tp.NAME""")
spark.sql(f"""create table iceberg_warehouse.t2 using iceberg
as SELECT /*+ BROADCAST(doc_tp) */
doc.DOCUMENT_DATE
, doc_tp.NAME as DOCUMENT_TYPE
, COUNT(*) as DOC_QTY
FROM load_test_full_warehouse.gen_document_type doc_tp
LEFT JOIN load_test_full_warehouse.generation_document_part doc
ON doc.DOCUMENT_TYPE_ID_INT = doc_tp.DOCUMENT_TYPE_ID_INT
GROUP BY doc.DOCUMENT_DATE, doc_tp.NAME""")
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]