Zhang Qi created SPARK-43163:
--------------------------------
Summary: An exception occurred while hive table join tidb table
Key: SPARK-43163
URL: https://issues.apache.org/jira/browse/SPARK-43163
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.2.3
Reporter: Zhang Qi
When executing a query of a hive partition table (big one) inner join a tidb
table(small one), the hive partition table is auto broadcasted, which leads an
error.
The query is somelike
{{select hive_table.col1,tidb_table.col2 from hive_table inner join tidb_table
on hive_table.col2=tidb_table.col3 where ...}}
== Physical Plan ==
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [... 109 more fields]
+- Generate HiveGenericUDTF#udf.json.JsonExtractValueUDTF(xxx), [, ... 101 more
fields], false, [...]
+- Project [, ... 102 more fields]
+- BroadcastHashJoin [xxx#94], [xxxx#475], Inner, BuildRight, false
:- TiKV CoprocessorRDD\{[table: xxx] TableReader, Columns: xxxx(): {
TableRangeScan: { RangeFilter: [], Range:
[([t\200\000\000\000\000\000\004\253_r\200\000\000\000\000\000\000\000],
[t\200\000\000\000\000\000\004\253_s\000\000\000\000\000\000\000\000])([t\200\000\000\000\000\000\004\253_r\000\000\000\000\000\000\000\000],
[t\200\000\000\000\000\000\004\253_r\200\000\000\000\000\000\000\000])] } },
startTs: 440854942292115639} EstimatedCount:20837
+- BroadcastExchange HashedRelationBroadcastMode(List(input[107, string,
false]),false), [plan_id=32]
+- Filter isnotnull(xxx#475)
+- Scan hive xx.xxxxxxxx [, ... 100 more fields], HiveTableRelation
[{{{}xx{}}}.{{{}xxx{}}},
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Data Cols: [...,
Partition Cols: [[#520|https://github.com/pingcap/tispark/issues/520],
[#521|https://github.com/pingcap/tispark/issues/521],
[#522|https://github.com/pingcap/tispark/issues/522],
[#523|https://github.com/pingcap/tispark/pull/523]], Pruned Partitions: [(, , ,
)]], [isnotnull(), (), (xx = xx)]
Here I got some log info maybe helpful.
The {{plan.stats.sizeInBytes}} of the LogicalPlan of the hive table is too
small and the {{plan.stats.sizeInBytes}} of LogicalPlan of the tidb table is
too big.
The stats of the LogicalPlans of the two seems reversed.
*Spark and TiSpark version info*
Spark 3.2.3
TiSpark 3.1.2(with a profile of spark-3.2)
*Additional context*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]