[ 
https://issues.apache.org/jira/browse/AVRO-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044266#comment-14044266
 ] 

Sachin Goyal commented on AVRO-695:
-----------------------------------


h3.Circular References
------------------------------------

*Serialization*
Extra API required (not optional): 
{code}ReflectData.setCircularRefIdPrefix("some-field-name"){code}

If the above is set, following happens:
  # During serialization, each record contains the extra field specified above. 
The value for this field is just a monotonically increasing number meant to 
uniquely identify each record in one particular serialization.
  # While writing schema, each RECORD schema is converted into a UNION schema 
such that it can either be a record or a string. During object serialization, 
if a record is seen before, it is not written as a record. Rather it is written 
as a string in the format: "some-field-name" + "ID-generated-in #1 above". With 
this structure, reading applications have enough information to restore the 
circular reference if they want. This structure is also usable by languages not 
supporting circular reference because they will read that circular-reference as 
a normal string.
(AllowNull also works with this).
  # Above field-name is included in each record as a property. This allows the 
readers to become aware of this field-name so that the clients do not have to 
specify this just to populate the circular references. Basically it makes the 
schema self-sufficient.

*Deserialization*
Extra API required (optional):
{code}GenericDatumReader.setResolveCircularRefs(boolean){code}

Based on #3 above, GenericDatumReader becomes circular-reference-aware.
But since all GenericDatumReaders share a common GenericData instance,  they 
are provided with another flag "resolveCircularRefs" to control whether they 
want to resolve circular references or not.
If this flag is set and the serialized schema has non-null value for 
circular-reference-field, GenericDatumReader does the following:
  # If any record has circular-ref-field, store its value and the corresponding 
record in a map.
  # Look for unions which can be serialized as a record as well as a string. On 
finding such a record serialized as a string, replace the string with the 
record retreived from the map created in #1




h3.Non-string map-keys
----------------------------------------

*Serialization*
No extra API required.

Without this patch, Avro throws an exception for non-string map-keys.
This patch converts such maps into an array of records where each record has 
two fields: key and value. Example:
Map<ObjX, ObjY> is converted to [{"key":{ObjX}, "value":{ObjY}}]
To do this, following is done:
  # In ReflectData.java, create schema for key as well as value in the 
non-string hash-map.
  Encapsulate these two schemas into a record schema and create an array schema 
of such records.
  Set property NS_MAP_FLAG to "1" and store the actual class of the map as a 
CLASS_PROP

  # While writing out a non-string map field, if NS_MAP_FLAG is set, convert 
map to array of records using map.entrySet()



*Deserialization*
No extra API required.

Deserialization for non-string map-keys is pretty simple since data and the 
schema match exactly.
So it just deserializes automatically.
To create an actual map (like when using ReflectDatumReader with actual-class 
type-parameter), map is instantiated using CLASS_PROP if the property 
NS_MAP_FLAG is set to "1"



h3.Testcases included
------------------------------------
The unit tests cover the following:
# Circular references at multiple levels of hierarchy
# Circular references within Collections and Maps. 
# Circular and non-circular deserialization of circularly serialized objects.
# Non-string map-keys having circular references.
# Non-string map-keys with nested maps.


> Cycle Reference Support
> -----------------------
>
>                 Key: AVRO-695
>                 URL: https://issues.apache.org/jira/browse/AVRO-695
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>    Affects Versions: 1.7.6
>            Reporter: Moustapha Cherri
>         Attachments: avro-1.4.1-cycle.patch.gz, avro-1.4.1-cycle.patch.gz, 
> avro_circular_references.zip, avro_circular_refs_2014_06_14.zip, 
> circular_refs_and_nonstring_map_keys_2014_06_25.zip
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This is a proposed implementation to add cycle reference support to Avro. It 
> basically introduce a new type named Cycle. Cycles contains a string 
> representing the path to the other reference.
> For example if we have an object of type Message that have a member named 
> previous with type Message too. If we have have this hierarchy:
> message
>   previous : message2
> message2
>   previous : message2
> When serializing the cycle path for "message2.previous" will be "previous".
> The implementation depend on ANTLR to evaluate those cycle at read time to 
> resolve them. I used ANTLR 3.2. This dependency is not mandated; I just used 
> ANTLR to speed thing up. I kept in this implementation the generated code 
> from ANTLR though this should not be the case as this should be generated 
> during the build. I only updated the Java code.
> I did not make full unit testing but you can find "avrotest.Main" class that 
> can be used a preliminary test.
> Please do not hesitate to contact me for further clarification if this seems 
> interresting.
> Best regards,
> Moustapha Cherri



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to