[ 
https://issues.apache.org/jira/browse/HUDI-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4959:
----------------------------------
    Description: 
Originally reported in:

[https://github.com/apache/hudi/issues/6621]

 

Kryo (used in SerializationUtils) by default allows class objects to be 
serialized w/o prior registration w/ Kryo: in that case Kryo will encode the 
first occurrence of the object of a particular class with full class-name, but 
subsequent occurrences will be using class-id associated with it (on the fly).

This poses issues for durable serialization (when we persist such serialized 
layout) in this case we're trying to deserialize file that doesn't have the 
class-name encoded and since user is running a different Spark job to read 
there's no association preserved in-memory either.

*NOTE: We should be using custom serialization sequences for every object we 
serialize for durable persistence, and avoid using frameworks like Kryo for 
that.*

 
----
*EDIT*

I'm taking back my hypothesis that the issue is in the class encoding, after 
writing a small test to validate the issue i confirmed that Kryo actually 
writes out full class-name for all classes registered implicitly (as it should).

It seems that the problem is actually indeed in misalignment of the Avro 
versions as reported by [@KnightChess|https://github.com/KnightChess]: 
quick-checking i see that b/w Avro 1.8.2 and 1.10.2, {{Utf8}} actually had one 
more field added:
{code:java}
  // 1.8.2 
  private byte[] bytes = EMPTY;
  private int length;
  private String string;

  // 1.10.2
  private byte[] bytes;
  private int hash;
  private int length;
  private String string; {code}
 
{{  }}Provided that we're relying on Kryo to generate serializer for 
{{orderingVal}} that could be {{Utf8}} (based on {{{}FieldSerializer{}}}) it 
would actually explain why it couldn't deserialize it back (since they will 
have different serializers).

  was:
Originally reported in:

https://github.com/apache/hudi/issues/6621

 

Kryo (used in SerializationUtils) by default allows class objects to be 
serialized w/o prior registration w/ Kryo: in that case Kryo will encode the 
first occurrence of the object of a particular class with full class-name, but 
subsequent occurrences will be using class-id associated with it (on the fly).

This poses issues for durable serialization (when we persist such serialized 
layout) in this case we're trying to deserialize file that doesn't have the 
class-name encoded and since user is running a different Spark job to read 
there's no association preserved in-memory either.

*NOTE: We should be using custom serialization sequences for every object we 
serialize for durable persistence, and avoid using frameworks like Kryo for 
that.*


> Serializing objects using Kryo fails to deserialize data back w/o prior 
> registration
> ------------------------------------------------------------------------------------
>
>                 Key: HUDI-4959
>                 URL: https://issues.apache.org/jira/browse/HUDI-4959
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: writer-core
>    Affects Versions: 0.12.0
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>             Fix For: 0.13.0
>
>
> Originally reported in:
> [https://github.com/apache/hudi/issues/6621]
>  
> Kryo (used in SerializationUtils) by default allows class objects to be 
> serialized w/o prior registration w/ Kryo: in that case Kryo will encode the 
> first occurrence of the object of a particular class with full class-name, 
> but subsequent occurrences will be using class-id associated with it (on the 
> fly).
> This poses issues for durable serialization (when we persist such serialized 
> layout) in this case we're trying to deserialize file that doesn't have the 
> class-name encoded and since user is running a different Spark job to read 
> there's no association preserved in-memory either.
> *NOTE: We should be using custom serialization sequences for every object we 
> serialize for durable persistence, and avoid using frameworks like Kryo for 
> that.*
>  
> ----
> *EDIT*
> I'm taking back my hypothesis that the issue is in the class encoding, after 
> writing a small test to validate the issue i confirmed that Kryo actually 
> writes out full class-name for all classes registered implicitly (as it 
> should).
> It seems that the problem is actually indeed in misalignment of the Avro 
> versions as reported by [@KnightChess|https://github.com/KnightChess]: 
> quick-checking i see that b/w Avro 1.8.2 and 1.10.2, {{Utf8}} actually had 
> one more field added:
> {code:java}
>   // 1.8.2 
>   private byte[] bytes = EMPTY;
>   private int length;
>   private String string;
>   // 1.10.2
>   private byte[] bytes;
>   private int hash;
>   private int length;
>   private String string; {code}
>  
> {{  }}Provided that we're relying on Kryo to generate serializer for 
> {{orderingVal}} that could be {{Utf8}} (based on {{{}FieldSerializer{}}}) it 
> would actually explain why it couldn't deserialize it back (since they will 
> have different serializers).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to