[ 
https://issues.apache.org/jira/browse/SPARK-31450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233054#comment-17233054
 ] 

Navin Viswanath commented on SPARK-31450:
-----------------------------------------

[~hvanhovell]  [~dongjoon] I was in the process of migrating some code from 
Spark 2.4 to Spark 3 and noticed that this required a change in our code. We 
use the following process to go from a Thrift type T to InternalRow(reading 
thrift files on HDFS into a Dataframe):
 # We construct a Spark schema by inspecting the thrift metadata.
 # We convert a thrift object to a GenericRow using the thrift metadata to read 
columns.
 # We then construct an ExpressionEncoder[Row] and use it to create an 
InternalRow as follows:
{code:java}
val schema: StructType = ... // infer thrift schema
val encoder: ExpressionEncoder[Row] = RowEncoder(schema)
val genericRow: GenericRow = toGenericRow(thriftObject, schema)
val internalRow: InternalRow = encoder.toRow(genericRow)
{code}

The above steps are used to implement
{code:java}
protected def buildReader(
      sparkSession: SparkSession,
      dataSchema: StructType,
      partitionSchema: StructType,
      requiredSchema: StructType,
      filters: Seq[Filter],
      options: Map[String, String],
      hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
{code}
in trait org.apache.spark.sql.execution.datasources.FileFormat where we need an 
Iterator[InternalRow].

With the change in this ticket, I would have to replace 
{code:java}
val internalRow: InternalRow = encoder.toRow(genericRow)  
{code}
with
{code:java}
val serializer = encoder.createSerializer()
val internalRow: InternalRow = serializer(genericRow){code}
Since this is marked as an internal API in the PR, I was wondering if there is 
a way to implement this so that it is compatible with both Spark 2.4 and Spark 
3. 

My goal is to not require a code change if possible. It seems to me that since 
I know the schema of the thrift type it should be possible to construct an 
InternalRow, but I don't see a way to do this in the code base.

> Make ExpressionEncoder thread safe
> ----------------------------------
>
>                 Key: SPARK-31450
>                 URL: https://issues.apache.org/jira/browse/SPARK-31450
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Herman van Hövell
>            Assignee: Herman van Hövell
>            Priority: Major
>             Fix For: 3.0.0
>
>
> ExpressionEncoder is currently not thread-safe because it contains stateful 
> objects that are required for converting objects to internal rows and vise 
> versa. We have been working around this by (excessively) cloning 
> ExpressionEncoders which is not free. I propose that we move the stateful 
> bits of the expression encoder into two helper classes that will take care of 
> the conversions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to