please help why there is big difference in partitionFilter in spark2.4.2 and spark3.1.3.

zhangliyun Sun, 05 Mar 2023 18:27:06 -0800

Hi all




  i have a spark sql  , before in  spark 2.4.2 it runs correctly, when i 
upgrade to spark 3.1.3, it has some problem.

 the sql

 ```

select * from eds_rds.cdh_prpc63cgudba_pp_index_disputecasedetails_hourly where 
dt >= date_sub('${today}',30);




```

it will load the data of past 30 days of table  
eds_rds.cdh_prpc63cgudba_pp_index_disputecasedetails_hourly, here 
today='2023-03-01'




in spark2  i saw the physical plan  the partition Filter is PartitionFilters: 
[isnotnull(dt#1461), (dt#1461 >= 2023-01-31)]
 +- *(4) FileScan parquet 
eds_rds.cdh_prpc63cgudba_pp_index_disputecasedetails_hourly[disputeid#1327,statuswork#1330,opTs#1457,trailSeqno#1459,trailRba#1460,dt#1461,hr#1462]
 Batched: true, Format: Parquet, Location: 
PrunedInMemoryFileIndex[gs://pypl-bkt-prd-row-std-gds-non-edw-tables/apps/risk/eds/eds_risk/eds_r...,
 PartitionCount: 805, PartitionFilters: [isnotnull(dt#1461), (dt#1461 >= 
2023-01-31)], PushedFilters: [IsNotNull(disputeid)], ReadSchema: 
struct<disputeid:string,statuswork:string,opTs:string,trailSeqno:string,trailRba:string>







in spark3 , i saw the physical plan ,  the partitionFilter is 
[isnotnull(dt#1602), (cast(dt#1602 as date) >= 19387)]

```
(8) Scan parquet eds_rds.cdh_prpc63cgudba_pp_index_disputecasedetails_hourly
Output [7]: [disputeid#1468, statuswork#1471, opTs#1598, trailSeqno#1600, 
trailRba#1601, dt#1602, hr#1603]
Batched: true
Location: InMemoryFileIndex 
[gs://pypl-bkt-prd-row-std-gds-non-edw-tables/apps/risk/eds/eds_risk/eds_rds/cdh/prpc63cgudba_pp_index_disputecasedetails/dt=2023-01-30/hr=00,
 ... 784 entries]
PartitionFilters: [isnotnull(dt#1602), (cast(dt#1602 as date) >= 19387)]
PushedFilters: [IsNotNull(disputeid)]
ReadSchema: 
struct<disputeid:string,statuswork:string,opTs:string,trailSeqno:string,trailRba:string>


```


here i want to ask why there is big difference in partitionFitler in spark2 and 
spark3,  i guess most my spark configure is similar in spark2 and spark3 to run 
the same sql

please help why there is big difference in partitionFilter in spark2.4.2 and spark3.1.3.

Reply via email to