[ 
https://issues.apache.org/jira/browse/SPARK-35744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367036#comment-17367036
 ] 

Steven Aerts commented on SPARK-35744:
--------------------------------------

Hi [~xkrogen] ,

we went a different path.  It follows more the flow of 
{{org.apache.spark.sql.catalyst.JavaTypeInference}}.
So you create an {{ExpressionEncoder}} by calling 
{{AvroSpecificRecordEncoder.from(classof[MySpecificRecord])}}. Which will take 
the schema of {{MySpecificRecord}} and based on it generate the 
expressions/code to serialize and deserialize to and from the generated classes.

The resulting {{StructType}} for the class matches the one you expect as it 
internally uses the {{SchemaConverters.toSqlType(schema)}}.  Which means it is 
compatible with all other avro handling withing spark.

The code is rather complete, performant and standalone.  It support (almost) 
all avro constructs.  The test set around it however is more entangled as it 
uses internal classes.

Feel free to contact me https://github.com/steven-aerts.

> Performance degradation in avro SpecificRecordBuilders
> ------------------------------------------------------
>
>                 Key: SPARK-35744
>                 URL: https://issues.apache.org/jira/browse/SPARK-35744
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.0
>            Reporter: Steven Aerts
>            Priority: Minor
>
> Creating this bug to let you know that when we tested out spark 3.2.0 we saw 
> a significant performance degradation where our code was handling Avro 
> Specific Record objects.  This slowed down some of our jobs with a factor 4.
> Spark 3.2.0 upsteps the avro version from 1.8.2 to 1.10.2.
> The degradation was caused by a change introduced in avro 1.9.0.  This change 
> degrades performance when creating avro specific records in certain 
> classloader topologies, like the ones used in spark.
> We notified and [proposed|https://github.com/apache/avro/pull/1253] a simple 
> fix upstream in the avro project.  (Links contain more details)
> It is unclear for us how many other projects are using avro specific records 
> in a spark context and will be impacted by this degradation.
>  Feel free to close this issue if you think this issue is too much of a 
> corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to