wangyum opened a new pull request #24003: [SPARK-19678][FOLLOW-UP][SQL] Add 
behavior change test when table statistics are incorrect
URL: https://github.com/apache/spark/pull/24003
 
 
   ## What changes were proposed in this pull request?
   
   Since Spark 2.2.0 
([SPARK-19678](https://issues.apache.org/jira/browse/SPARK-19678)), the below 
SQL changed from `broadcast join` to `sort merge join`:
   ```sql
   -- small external table with incorrect statistics
   CREATE EXTERNAL TABLE t1(c1 int)
   ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
   WITH SERDEPROPERTIES (
     'serialization.format' = '1'
   )
   STORED AS
     INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
     OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION 'file:///tmp/t1'
   TBLPROPERTIES (
   'rawDataSize'='-1', 'numFiles'='0', 'totalSize'='0', 
'COLUMN_STATS_ACCURATE'='false', 'numRows'='-1'
   );
   
   -- big table
   CREATE TABLE t2 (c1 int)
   LOCATION 'file:///tmp/t2'
   TBLPROPERTIES (
   'rawDataSize'='23437737', 'numFiles'='12222', 'totalSize'='333442230', 
'COLUMN_STATS_ACCURATE'='false', 'numRows'='443442223'
   );
   
   explain SELECT t1.c1 FROM t1 INNER JOIN t2 ON t1.c1 = t2.c1;
   ```
   This pr add a test case for this behavior change.
   
   ## How was this patch tested?
   
   unit tests
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to