Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21847#discussion_r205648911
  
    --- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
    @@ -87,17 +88,33 @@ class AvroSerializer(rootCatalystType: DataType, 
rootAvroType: Schema, nullable:
           case d: DecimalType =>
             (getter, ordinal) => getter.getDecimal(ordinal, d.precision, 
d.scale).toString
           case StringType =>
    -        (getter, ordinal) => new 
Utf8(getter.getUTF8String(ordinal).getBytes)
    +        (getter, ordinal) =>
    +          if (avroType.getType == Type.ENUM) {
    +            new GenericData.EnumSymbol(avroType, 
getter.getUTF8String(ordinal).toString)
    +          } else {
    +            new Utf8(getter.getUTF8String(ordinal).getBytes)
    +          }
           case BinaryType =>
    -        (getter, ordinal) => ByteBuffer.wrap(getter.getBinary(ordinal))
    +        (getter, ordinal) =>
    +          val data = getter.getBinary(ordinal)
    +          if (avroType.getType == Type.FIXED) {
    +            // Handles fixed-type fields in output schema.  Test case is 
included in test.avro
    +            // as it includes several fixed fields that would fail if we 
specify schema
    +            // on-write without this condition
    +            val fixed = new GenericData.Fixed(avroType)
    +            fixed.bytes(data)
    +            fixed
    +          } else {
    +            ByteBuffer.wrap(data)
    +          }
    --- End diff --
    
    This might be slow. In the executors, when each row is going to be 
serialized, the whole `if-else` will be executed again and agin to get a 
specialized converter. We can consider to resolve the specialized types earlier 
in driver by
    ```scala
    import org.apache.avro.generic.GenericData.{Fixed, EnumSymbol}
    ...
          case StringType =>
            if (avroType.getType == Type.ENUM) {
              (getter, ordinal) => new EnumSymbol(avroType, 
getter.getUTF8String(ordinal).toString)
            } else {
              (getter, ordinal) => new 
Utf8(getter.getUTF8String(ordinal).getBytes)
            }
          case BinaryType =>
            if (avroType.getType == Type.FIXED) {
              (getter, ordinal) => new Fixed(avroType, 
getter.getBinary(ordinal))
            } else {
              (getter, ordinal) => ByteBuffer.wrap(getter.getBinary(ordinal))
            }
    ```
    so the returned lambda expression will not have any check on `FIXED` or 
`ENUM` types.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to