[
https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Srini E reopened SPARK-29335:
-----------------------------
Updated the Label to Question and Stack ovefflow
> Cost Based Optimizer stats are not used while evaluating query plans in Spark
> Sql
> ---------------------------------------------------------------------------------
>
> Key: SPARK-29335
> URL: https://issues.apache.org/jira/browse/SPARK-29335
> Project: Spark
> Issue Type: Question
> Components: Optimizer
> Affects Versions: 2.3.0
> Environment: We tried to execute the same using Spark-sql and Thrify
> server using SQLWorkbench but we are not able to use the stats.
> Reporter: Srini E
> Priority: Major
> Labels: Question, stack-overflow
> Attachments: explain_plan_cbo_spark.txt
>
>
> We are trying to leverage CBO for getting better plan results for few
> critical queries run thru spark-sql or thru thrift server using jdbc driver.
> Following settings added to spark-defaults.conf
> {code}
> spark.sql.cbo.enabled true
> spark.experimental.extrastrategies intervaljoin
> spark.sql.cbo.joinreorder.enabled true
> {code}
>
> The tables that we are using are not partitioned.
> {code}
> spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ;
> analyze table arrow.t_fperiods_sundar compute statistics for columns eid,
> year, ptype, absref, fpid , pid ;
> analyze table arrow.t_fdata_sundar compute statistics ;
> analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid,
> absref;
> {code}
> Analyze completed success fully.
> Describe extended , does not show column level stats data and queries are not
> leveraging table or column level stats .
> we are using Oracle as our Hive Catalog store and not Glue .
> *When we are using spark sql and running queries we are not able to see the
> stats in use in the explain plan and we are not sure if cbo is put to use.*
> *A quick response would be helpful.*
> *Explain Plan:*
> Following Explain command does not reference to any Statistics usage.
>
> {code}
> spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref
> from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid =
> a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017
> and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;*
>
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters:
> isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 =
> 2017),(ptype#4546 = A),(eid#4542 =
> 29940),isnull(PID#4527),isnotnull(fpid#4523)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct<FPID:
> decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ...
> 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters:
> IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters:
> isnotnull(absref#4569),(absref#4569 =
> Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct<IMNEM:
> string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF:
> string ... 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters:
> IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940)
> == Parsed Logical Plan ==
> 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref]
> +- 'Filter (((('a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) &&
> (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) &&
> ('a12.eid = 29940)) && isnull('a12.PID)))
> +- 'Join Inner
> :- 'SubqueryAlias a12
> : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar`
> +- 'SubqueryAlias a13
> +- 'UnresolvedRelation `arrow`.`t_fdata_sundar`
>
> == Analyzed Logical Plan ==
> imnem: string, fvalue: string, ptype: string, absref: string
> Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- Filter ((((eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) =
> cast(fpid#4567 as decimal(38,0)))) && ((absref#4569 = Y2017) && (year#4545 =
> 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) =
> cast(cast(29940 as decimal(5,0)) as decimal(10,0)))) && isnull(PID#4527)))
> +- Join Inner
> :- SubqueryAlias a12
> : +- SubqueryAlias t_fperiods_sundar
> : +-
> Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546]
> parquet
> +- SubqueryAlias a13
> +- SubqueryAlias t_fdata_sundar
> +-
> Relation[FDID#4547,IMNEM#4548,DSET#4549,DSRS#4550,PID#4551,FVALUE#4552,FCOMP#4553,CONF#4554,UPDDATE#4555,UPDUSER#4556,SEQ#4557,CCD#4558,NID#4559,CONVTYPE#4560,DATADATE#4561,DCOMMENT#4562,UDOMAIN#4563,FPDID#4564L,VLP_CALC#4565,EID#4566,FPID#4567,BATCH_ID#4568L,ABSREF#4569]
> parquet
>
> == Optimized Logical Plan ==
> Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- Join Inner, ((eid#4542 = eid#4566) && (fpid#4523 = cast(fpid#4567 as
> decimal(38,0))))
> :- Project [FPID#4523, EID#4542, PTYPE#4546]
> : +- Filter (((((((isnotnull(ptype#4546) && isnotnull(year#4545)) &&
> isnotnull(eid#4542)) && (year#4545 = 2017)) && (ptype#4546 = A)) && (eid#4542
> = 29940)) && isnull(PID#4527)) && isnotnull(fpid#4523))
> : +-
> Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546]
> parquet
> +- Project [IMNEM#4548, FVALUE#4552, EID#4566, FPID#4567, ABSREF#4569]
> +- Filter ((((isnotnull(absref#4569) && (absref#4569 = Y2017)) &&
> isnotnull(fpid#4567)) && isnotnull(eid#4566)) && (eid#4566 = 29940))
> +-
> Relation[FDID#4547,IMNEM#4548,DSET#4549,DSRS#4550,PID#4551,FVALUE#4552,FCOMP#4553,CONF#4554,UPDDATE#4555,UPDUSER#4556,SEQ#4557,CCD#4558,NID#4559,CONVTYPE#4560,DATADATE#4561,DCOMMENT#4562,UDOMAIN#4563,FPDID#4564L,VLP_CALC#4565,EID#4566,FPID#4567,BATCH_ID#4568L,ABSREF#4569]
> parquet
>
> == Physical Plan ==
> *(2) Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- *(2) BroadcastHashJoin [eid#4542, fpid#4523], [eid#4566, cast(fpid#4567 as
> decimal(38,0))], Inner, BuildLeft
> :- BroadcastExchange HashedRelationBroadcastMode(List(input[1,
> decimal(10,0), true], input[0, decimal(38,0), true]))
> : +- *(1) Project [FPID#4523, EID#4542, PTYPE#4546]
> : +- *(1) Filter (((((((isnotnull(ptype#4546) && isnotnull(year#4545)) &&
> isnotnull(eid#4542)) && (year#4545 = 2017)) && (ptype#4546 = A)) && (eid#4542
> = 29940)) && isnull(PID#4527)) && isnotnull(fpid#4523))
> : +- *(1) FileScan parquet
> arrow.t_fperiods_sundar[FPID#4523,PID#4527,EID#4542,YEAR#4545,PTYPE#4546]
> Batched: true, Format: Parquet, Location:
> InMemoryFileIndex[s3://spr-idf-uat-processed/replication/base/APPLICATION_SCHEMA/ARROW/FPERIODS_S...,
> PartitionFilters: [], PushedFilters: [IsNotNull(PTYPE), IsNotNull(YEAR),
> IsNotNull(EID), EqualTo(YEAR,2017), EqualTo(PTYPE,A), EqualTo..., ReadSchema:
> struct<FPID:decimal(38,0),PID:string,EID:decimal(10,0),YEAR:int,PTYPE:string>
> +- *(2) Project [IMNEM#4548, FVALUE#4552, EID#4566, FPID#4567, ABSREF#4569]
> +- *(2) Filter ((((isnotnull(absref#4569) && (absref#4569 = Y2017)) &&
> isnotnull(fpid#4567)) && isnotnull(eid#4566)) && (eid#4566 = 29940))
> +- *(2) FileScan parquet
> arrow.t_fdata_sundar[IMNEM#4548,FVALUE#4552,EID#4566,FPID#4567,ABSREF#4569]
> Batched: true, Format: Parquet, Location:
> InMemoryFileIndex[s3://spr-idf-uat-processed/replication/base/APPLICATION_SCHEMA/ARROW/FDATA_SUNDAR],
> PartitionFilters: [], PushedFilters: [IsNotNull(ABSREF),
> EqualTo(ABSREF,Y2017), IsNotNull(FPID), IsNotNull(EID), EqualTo(EID,29940)],
> ReadSchema:
> struct<IMNEM:string,FVALUE:string,EID:decimal(10,0),FPID:decimal(10,0),ABSREF:string>
> Time taken: 0.35 seconds, Fetched 1 row(s)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]