xkrogen commented on a change in pull request #31490:
URL: https://github.com/apache/spark/pull/31490#discussion_r658895264



##########
File path: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
##########
@@ -335,36 +336,32 @@ private[sql] class AvroDeserializer(
       avroPath: Seq[String],
       catalystPath: Seq[String],
       applyFilters: Int => Boolean): (CatalystDataUpdater, GenericRecord) => 
Boolean = {
-    val validFieldIndexes = ArrayBuffer.empty[Int]
-    val fieldWriters = ArrayBuffer.empty[(CatalystDataUpdater, Any) => Unit]
-
-    val avroSchemaHelper = new AvroUtils.AvroSchemaHelper(avroType, avroPath)
-    val length = catalystType.length
-    var i = 0
-    while (i < length) {
-      val catalystField = catalystType.fields(i)
-      avroSchemaHelper.getFieldByName(catalystField.name) match {
-        case Some(avroField) =>
-          validFieldIndexes += avroField.pos()
 
+    val avroSchemaHelper =
+      new AvroUtils.AvroSchemaHelper(avroType, catalystType, avroPath, 
positionalFieldMatch)
+
+    avroSchemaHelper.getCatalystFieldsWithoutMatch.filterNot(_.nullable) match 
{

Review comment:
       On the `AvroSerializer`, we could just check the schema lengths, but 
then when there is a mismatch, you don't get a very helpful error message: 
"Expected 7 fields but got 6" is not nearly as helpful as "Avro schema is 
missing field 'foo'" (at least IMO). I think it's worth the extra complexity 
here to have more useful error messages, and I think this argument will be 
strengthened by the `AvroDeserializer` discussion below.
   
   On the deserializer side, checking the length is not sufficient. This is 
because we actually allow Avro fields which don't have a matched Catalyst field 
(note that we don't bother calling `avroSchemaHelper.getAvroFieldsWithoutMatch` 
in the deserializer), treating this as projection. Similarly, we allow Catalyst 
fields which don't have a matching Avro field, as long as the Catalyst field is 
nullable, hence the `filterNot(_.nullable)` call. Neither of these things would 
be possible if we just checked the schema length. Note that even before this 
PR, we only perform a check on the schema lengths in the serializer, not the 
deserializer.
   
   Given that it's actually necessary for us to implement this logic on the 
deserializer side, I think it makes sense to also leverage it on the serializer 
side to provide better error messages.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to