[
https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Ash updated SPARK-19213:
-------------------------------
Summary: FileSourceScanExec uses SparkSession from HadoopFsRelation
creation time instead of the active session at execution time (was:
FileSourceScanExec usese sparksession from hadoopfsrelation creation time
instead of the one active at time of execution)
> FileSourceScanExec uses SparkSession from HadoopFsRelation creation time
> instead of the active session at execution time
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-19213
> URL: https://issues.apache.org/jira/browse/SPARK-19213
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Robert Kruszewski
>
> If you look at
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260
> you'll notice that the sparksession used for execution is the one that was
> captured from logicalplan. Whereas in other places you have
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154
> and SparkPlan captures active session upon execution in
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52
> From my understanding of the code it looks like we should be using the
> sparksession that is currently active hence take the one from spark plan.
> However, in case you want share Datasets across SparkSessions that is not
> enough since as soon as dataset is executed the queryexecution will have
> capture spark session at that point. If we want to share datasets across
> users we need to make configurations not fixed upon first execution. I
> consider 1st part (using sparksession from logical plan) a bug while the
> second (using sparksession active at runtime) an enhancement so that sharing
> across sessions is made easier.
> For example:
> {code}
> val df = spark.read.parquet(...)
> df.count()
> val newSession = spark.newSession()
> SparkSession.setActiveSession(newSession)
> // <change parameters in newSession> (simplest one to try is disable
> vectorized reads)
> val df2 = Dataset.ofRows(newSession, df.logicalPlan) // logical plan still
> holds reference to original sparksession and changes don't take effect
> {code}
> I suggest that it shouldn't be necessary to create a new dataset for changes
> to take effect. For most of the plans doing Dataset.ofRows work but this is
> not the case for hadoopfsrelation.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]