[jira] [Reopened] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql

Srini E (Jira) Tue, 08 Oct 2019 11:58:20 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Srini E reopened SPARK-29335:
-----------------------------

Updated the Label to Question and Stack ovefflow

> Cost Based Optimizer stats are not used while evaluating query plans in Spark 
> Sql
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-29335
>                 URL: https://issues.apache.org/jira/browse/SPARK-29335
>             Project: Spark
>          Issue Type: Question
>          Components: Optimizer
>    Affects Versions: 2.3.0
>         Environment: We tried to execute the same using Spark-sql and Thrify 
> server using SQLWorkbench but we are not able to use the stats.
>            Reporter: Srini E
>            Priority: Major
>              Labels: Question, stack-overflow
>         Attachments: explain_plan_cbo_spark.txt
>
>
> We are trying to leverage CBO for getting better plan results for few 
> critical queries run thru spark-sql or thru thrift server using jdbc driver. 
> Following settings added to spark-defaults.conf
> {code}
> spark.sql.cbo.enabled true
> spark.experimental.extrastrategies intervaljoin
> spark.sql.cbo.joinreorder.enabled true
> {code}
>  
> The tables that we are using are not partitioned.
> {code}
> spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ;
> analyze table arrow.t_fperiods_sundar compute statistics for columns eid, 
> year, ptype, absref, fpid , pid ;
> analyze table arrow.t_fdata_sundar compute statistics ;
> analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, 
> absref;
> {code}
> Analyze completed success fully.
> Describe extended , does not show column level stats data and queries are not 
> leveraging table or column level stats .
> we are using Oracle as our Hive Catalog store and not Glue .
> *When we are using spark sql and running queries we are not able to see the 
> stats in use in the explain plan and we are not sure if cbo is put to use.*
> *A quick response would be helpful.*
> *Explain Plan:*
> Following Explain command does not reference to any Statistics usage.
>  
> {code}
> spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref 
> from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = 
> a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 
> and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;*
>  
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = 
> 2017),(ptype#4546 = A),(eid#4542 = 
> 29940),isnull(PID#4527),isnotnull(fpid#4523)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct<FPID: 
> decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... 
> 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(absref#4569),(absref#4569 = 
> Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct<IMNEM: 
> string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: 
> string ... 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940)
> == Parsed Logical Plan ==
> 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref]
> +- 'Filter (((('a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && 
> (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && 
> ('a12.eid = 29940)) && isnull('a12.PID)))
>  +- 'Join Inner
>  :- 'SubqueryAlias a12
>  : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar`
>  +- 'SubqueryAlias a13
>  +- 'UnresolvedRelation `arrow`.`t_fdata_sundar`
>  
> == Analyzed Logical Plan ==
> imnem: string, fvalue: string, ptype: string, absref: string
> Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- Filter ((((eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = 
> cast(fpid#4567 as decimal(38,0)))) && ((absref#4569 = Y2017) && (year#4545 = 
> 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = 
> cast(cast(29940 as decimal(5,0)) as decimal(10,0)))) && isnull(PID#4527)))
>  +- Join Inner
>  :- SubqueryAlias a12
>  : +- SubqueryAlias t_fperiods_sundar
>  : +- 
> Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546]
>  parquet
>  +- SubqueryAlias a13
>  +- SubqueryAlias t_fdata_sundar
>  +- 
> Relation[FDID#4547,IMNEM#4548,DSET#4549,DSRS#4550,PID#4551,FVALUE#4552,FCOMP#4553,CONF#4554,UPDDATE#4555,UPDUSER#4556,SEQ#4557,CCD#4558,NID#4559,CONVTYPE#4560,DATADATE#4561,DCOMMENT#4562,UDOMAIN#4563,FPDID#4564L,VLP_CALC#4565,EID#4566,FPID#4567,BATCH_ID#4568L,ABSREF#4569]
>  parquet
>  
> == Optimized Logical Plan ==
> Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- Join Inner, ((eid#4542 = eid#4566) && (fpid#4523 = cast(fpid#4567 as 
> decimal(38,0))))
>  :- Project [FPID#4523, EID#4542, PTYPE#4546]
>  : +- Filter (((((((isnotnull(ptype#4546) && isnotnull(year#4545)) && 
> isnotnull(eid#4542)) && (year#4545 = 2017)) && (ptype#4546 = A)) && (eid#4542 
> = 29940)) && isnull(PID#4527)) && isnotnull(fpid#4523))
>  : +- 
> Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546]
>  parquet
>  +- Project [IMNEM#4548, FVALUE#4552, EID#4566, FPID#4567, ABSREF#4569]
>  +- Filter ((((isnotnull(absref#4569) && (absref#4569 = Y2017)) && 
> isnotnull(fpid#4567)) && isnotnull(eid#4566)) && (eid#4566 = 29940))
>  +- 
> Relation[FDID#4547,IMNEM#4548,DSET#4549,DSRS#4550,PID#4551,FVALUE#4552,FCOMP#4553,CONF#4554,UPDDATE#4555,UPDUSER#4556,SEQ#4557,CCD#4558,NID#4559,CONVTYPE#4560,DATADATE#4561,DCOMMENT#4562,UDOMAIN#4563,FPDID#4564L,VLP_CALC#4565,EID#4566,FPID#4567,BATCH_ID#4568L,ABSREF#4569]
>  parquet
>  
> == Physical Plan ==
> *(2) Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- *(2) BroadcastHashJoin [eid#4542, fpid#4523], [eid#4566, cast(fpid#4567 as 
> decimal(38,0))], Inner, BuildLeft
>  :- BroadcastExchange HashedRelationBroadcastMode(List(input[1, 
> decimal(10,0), true], input[0, decimal(38,0), true]))
>  : +- *(1) Project [FPID#4523, EID#4542, PTYPE#4546]
>  : +- *(1) Filter (((((((isnotnull(ptype#4546) && isnotnull(year#4545)) && 
> isnotnull(eid#4542)) && (year#4545 = 2017)) && (ptype#4546 = A)) && (eid#4542 
> = 29940)) && isnull(PID#4527)) && isnotnull(fpid#4523))
>  : +- *(1) FileScan parquet 
> arrow.t_fperiods_sundar[FPID#4523,PID#4527,EID#4542,YEAR#4545,PTYPE#4546] 
> Batched: true, Format: Parquet, Location: 
> InMemoryFileIndex[s3://spr-idf-uat-processed/replication/base/APPLICATION_SCHEMA/ARROW/FPERIODS_S...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(PTYPE), IsNotNull(YEAR), 
> IsNotNull(EID), EqualTo(YEAR,2017), EqualTo(PTYPE,A), EqualTo..., ReadSchema: 
> struct<FPID:decimal(38,0),PID:string,EID:decimal(10,0),YEAR:int,PTYPE:string>
>  +- *(2) Project [IMNEM#4548, FVALUE#4552, EID#4566, FPID#4567, ABSREF#4569]
>  +- *(2) Filter ((((isnotnull(absref#4569) && (absref#4569 = Y2017)) && 
> isnotnull(fpid#4567)) && isnotnull(eid#4566)) && (eid#4566 = 29940))
>  +- *(2) FileScan parquet 
> arrow.t_fdata_sundar[IMNEM#4548,FVALUE#4552,EID#4566,FPID#4567,ABSREF#4569] 
> Batched: true, Format: Parquet, Location: 
> InMemoryFileIndex[s3://spr-idf-uat-processed/replication/base/APPLICATION_SCHEMA/ARROW/FDATA_SUNDAR],
>  PartitionFilters: [], PushedFilters: [IsNotNull(ABSREF), 
> EqualTo(ABSREF,Y2017), IsNotNull(FPID), IsNotNull(EID), EqualTo(EID,29940)], 
> ReadSchema: 
> struct<IMNEM:string,FVALUE:string,EID:decimal(10,0),FPID:decimal(10,0),ABSREF:string>
> Time taken: 0.35 seconds, Fetched 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Reopened] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql

Reply via email to