Zhang Qi created SPARK-43163:
--------------------------------

             Summary: An exception occurred while hive table join tidb table
                 Key: SPARK-43163
                 URL: https://issues.apache.org/jira/browse/SPARK-43163
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.2.3
            Reporter: Zhang Qi


When executing a query of a hive partition table (big one) inner join a tidb 
table(small one), the hive partition table is auto broadcasted, which leads an 
error.
The query is somelike
 {{select hive_table.col1,tidb_table.col2 from hive_table inner join tidb_table 
on hive_table.col2=tidb_table.col3 where ...}}
== Physical Plan ==
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [... 109 more fields]
+- Generate HiveGenericUDTF#udf.json.JsonExtractValueUDTF(xxx), [, ... 101 more 
fields], false, [...]
+- Project [, ... 102 more fields]
+- BroadcastHashJoin [xxx#94], [xxxx#475], Inner, BuildRight, false
:- TiKV CoprocessorRDD\{[table: xxx] TableReader, Columns: xxxx(): { 
TableRangeScan: { RangeFilter: [], Range: 
[([t\200\000\000\000\000\000\004\253_r\200\000\000\000\000\000\000\000], 
[t\200\000\000\000\000\000\004\253_s\000\000\000\000\000\000\000\000])([t\200\000\000\000\000\000\004\253_r\000\000\000\000\000\000\000\000],
 [t\200\000\000\000\000\000\004\253_r\200\000\000\000\000\000\000\000])] } }, 
startTs: 440854942292115639} EstimatedCount:20837
+- BroadcastExchange HashedRelationBroadcastMode(List(input[107, string, 
false]),false), [plan_id=32]
+- Filter isnotnull(xxx#475)
+- Scan hive xx.xxxxxxxx [, ... 100 more fields], HiveTableRelation 
[{{{}xx{}}}.{{{}xxx{}}}, 
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Data Cols: [..., 
Partition Cols: [[#520|https://github.com/pingcap/tispark/issues/520], 
[#521|https://github.com/pingcap/tispark/issues/521], 
[#522|https://github.com/pingcap/tispark/issues/522], 
[#523|https://github.com/pingcap/tispark/pull/523]], Pruned Partitions: [(, , , 
)]], [isnotnull(), (), (xx = xx)]

Here I got some log info maybe helpful.
The {{plan.stats.sizeInBytes}} of the LogicalPlan of the hive table is too 
small and the {{plan.stats.sizeInBytes}} of LogicalPlan of the tidb table is 
too big.
The stats of the LogicalPlans of the two seems reversed.

*Spark and TiSpark version info*
Spark 3.2.3
TiSpark 3.1.2(with a profile of spark-3.2)
*Additional context*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to