[ 
https://issues.apache.org/jira/browse/AVRO-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15360955#comment-15360955
 ] 

Yibing Shi commented on AVRO-1811:
----------------------------------

Hi [~rdblue], this problem may not be very easy. It happens when deep copying 
an object that uses {{Utf8}} as string representation into a specific record 
which uses {{java.lang.String}} instead. We can simplify the issue with below 
unit test:
{code}
  @Test
  public void testStringDeepCopy() throws IOException {
    FooBarSpecificRecord specificRecord = FooBarSpecificRecord.newBuilder()
        .setId(1)
        .setName("test_record_specific")
        .setNicknames(new ArrayList<String>())
        .setRelatedids(new ArrayList<Integer>())
        .build();

    GenericRecordBuilder builder = new 
GenericRecordBuilder(specificRecord.getSchema());
    GenericRecord genericRecord = builder
        .set("id", 1)
        .set("name", new Utf8("test_record_specific"))
        .set("nicknames", new ArrayList<String>())
        .set("relatedids", new ArrayList<Integer>())
        .build();

    FooBarSpecificRecord copiedFromGeneric = (FooBarSpecificRecord) 
SpecificData.get().deepCopy(
        FooBarSpecificRecord.getClassSchema(), genericRecord);

    assertEquals("Should get an equal record by deep copying the generic 
record",
        specificRecord, copiedFromGeneric);
  }
{code}

This unit test fails in master branch with below exception:
{noformat}
java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be cast to 
java.lang.String
        at 
org.apache.avro.FooBarSpecificRecord.put(FooBarSpecificRecord.java:70)
        at org.apache.avro.generic.GenericData.setField(GenericData.java:659)
        at org.apache.avro.generic.GenericData.setField(GenericData.java:676)
        at org.apache.avro.generic.GenericData.deepCopy(GenericData.java:1081)
        at 
org.apache.avro.generic.TestGenericData.testStringDeepCopy(TestGenericData.java:450)
{noformat}

Looking at [the 
code|https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L1079-L1081]:
{code}
          Object newValue = deepCopy(f.schema(),
                                     getField(value, name, pos, oldState));
          setField(newRecord, name, pos, newValue, newState);
{code}
Variable newValue is an {{Utf8}} object because {{getField}} of the generic 
record uses it for field "name", while {{setField}} method in 
{{FooBarSpecificRecord}} expects a {{java.lang.string}} as input parameter. 
This causes above exception.

Method "deepCopy" is defined with below signature:
{code}
  public <T> T deepCopy(Schema schema, T value) {
{code}
which makes the problem a bit complicated. It can only return an object of 
exactly the same type as input parameter "value". That is, if we pass in an 
{{Utf8}} object as "value", we cannot return a {{java.lang.String}} object. 

The only way I can find to solve this problem is to finish the conversion 
before calling {{setField}} when deep copying records. That is, change above 
code to something like below:
{code}
          Object newValue = deepCopy(f.schema(),
                                     getField(value, name, pos, oldState));
          if (STRING_TYPE_STRING.equals(f.schema().getProp(STRING_PROP))) {
            if (! (newValue instanceof java.lang.String)) {
              newValue = newValue.toString();
            }
          } else {
            if (newValue instanceof java.lang.String) {
              newValue = new Utf8((String) newValue);
            }
          }
          setField(newRecord, name, pos, newValue, newState);
{code}

I will submit a patch based on this thought. 

Further thoughts:
# [~ryonday], why did you have incompatible schemas in Specific Data and 
Generic Data in the unit test you provided 
[here|https://github.com/ryonday/avroDecodingHelp/blob/master/1.8.0/src/test/java/com/ryonday/avro/test/v180/AvroDeepCopyTest.java]?
 How did you generate the avsc files?
# [~rdblue], this problem reminds me of the Logical Types. A logical type field 
can have 2 representations: one uses raw type and the other uses upper level 
java types. For instance, for "decimal" type, we can use both ByteBuffer and 
BigDecimal. Do we need support copying between them?



> SpecificData.deepCopy() cannot be used if schema compiler generated Java 
> objects with Strings instead of UTF8
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: AVRO-1811
>                 URL: https://issues.apache.org/jira/browse/AVRO-1811
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.8.0
>            Reporter: Ryon Day
>            Priority: Critical
>
> {panel:title=Description|titleBGColor=#3FA|bgColor=#DDD}
> When the Avro compiler creates Java objects, you have the option to have them 
> generate fields of type {{string}} with the Java standard {{String}} type, 
> for wide interoperability with existing Java applications and APIs.
> By default, however, the compiler outputs these fields in the Avro-specific 
> {{Utf8}} type, requiring frequent usage of the {{toString()}} method in order 
> for default domain objects to be used with the majority of Java libraries.
> There are two ways to get around this. The first is to annotate every 
> {{string}} field in a schema like so:
> {code}
>     {
>       "name": "some_string",
>       "doc": "a field that is guaranteed to compile to java.lang.String",
>       "type": [
>         "null",
>         {
>           "type": "string",
>           "avro.java.string": "String"
>         }
>       ]
>     },
> {code}
> Unfortunately, long schemas containing many string fields can be dominated by 
> this annotation by volume; for teams using heterogenous clients, they may to 
> want to avoid  Java-specific annotation in their schema files, or may not 
> think to use it unless there exist Java exploiters of the schema at the time 
> the schema is proposed and written.
> The other solution to the problem is to compile the schema into Java objects  
> using the {{SpecificCompiler}}'s string type selection. This option actually 
> alters the schema carried by the object's {{SCHEMA$}} field to have the above 
> annotation in it, ensuring that when used by the Java API, the String type 
> will be used. 
> Unfortunately, this method is not interoperable with GenericRecords created 
> by libraries that use the _original_ schema.
> {panel}
> {panel:title=Steps To Reproduce|titleBGColor=#8DB|bgColor=#DDD}
> # Create a schema with several {{string}} fields.
> # Parse the schema using the standard Avro schema parser
> # Create Java domain objects for that schema ensuring usage of the 
> {{java.lang.String}} string type.
> # Create a message of some sort that ends up as a {{GenericRecord}} of the 
> original schema
> # Attempt to use {{SpecificData.deepCopy()}} to make a {{SpecificRecord}} out 
> of the {{GenericRecord}} 
> There is a unit test that demonstrate this 
> [here|https://github.com/ryonday/avroDecodingHelp/blob/master/1.8.0/src/test/java/com/ryonday/avro/test/v180/AvroDeepCopyTest.java]
> {panel}
> {panel:title=Expected Results|titleBGColor=#AD3|bgColor=#DDD}
> As the schemas are literally identical aside from string type, the conversion 
> should work (and does work for schema that are exactly identical).
> {panel}
> {panel:title=Actual Results|titleBGColor=#D55|bgColor=#DDD}
> {{ClassCastException}} with the message {{org.apache.avro.util.Utf8 cannot be 
> cast to java.lang.String}}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to