viirya opened a new pull request #24805: [SPARK-27798][SQL] from_avro shouldn't 
produces same value when converted to local relation
URL: https://github.com/apache/spark/pull/24805
 
 
   ## What changes were proposed in this pull request?
   
   When using `from_avro` to deserialize avro data to catalyst StructType 
format, if `ConvertToLocalRelation` is applied at the time, `from_avro` 
produces only the last value (overriding previous values).
   
   The cause is `AvroDeserializer` reuses output row for StructType. Normally, 
it should be fine in Spark SQL. But `ConvertToLocalRelation` just uses 
`InterpretedProjection` to project local rows. `InterpretedProjection` creates 
new row for each output thro, it includes the same nested row object from 
`AvroDeserializer`. By the end, converted local relation has only last value.
   
   I think there're two possible options:
   
   1. Make `AvroDeserializer` output new row for StructType.
   2. Use `InterpretedMutableProjection` in `ConvertToLocalRelation` and call 
`copy()` on output rows.
   
   Option 2 is chose because previously `ConvertToLocalRelation` also creates 
new rows, this `InterpretedMutableProjection` + `copy()` shoudn't bring too 
much performance penalty. `ConvertToLocalRelation` should be arguably less 
critical, compared with `AvroDeserializer`.
   
   ## How was this patch tested?
   
   Added test.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to