[
https://issues.apache.org/jira/browse/AVRO-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068023#comment-14068023
]
Sachin Goyal commented on AVRO-695:
-----------------------------------
{quote}
The writer would add an entry to an IdentityHashMap<Object,Integer> for every
sub-record it writes. Whenever it encounters a previously-written record, it
writes a ref instead. Similarly, the reader would add each records it reads to
an array, and when a ref is read, return the corresponding element of the array.
{quote}
The current fix does use an IdentityHashMap to do this. Reference code in patch:
# GenericDatumWriter.java, line 40 and
# GenericDatumReader.java, line 46
Please correct me if I am wrong, but it appears the schema generated for a
circular list should look somewhat like this:
{code:javascript}
{
"type" : "record",
"name" : "CircularList",
"namespace" : "org.apache.avro.generic",
"fields" : [ {
"name" : "__crefId",
"type" : "string"
}, {
"name" : "nodeData",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "next",
"type" : [ "null", "CircularList", "string" ],
"default" : null
} ],
"circularRefIdPrefix" : "__crefId"
}
{code}
(This is generated using current patch)
\\
\\
Circular references could be just anywhere in the code.
For example, in a family-tree involving grandparents, uncles, aunts, cousins,
children, grandchildren etc. circular references could be encountered for many
branches outgoing from a single node.
Since we do not know which outgoing link would reveal itself as an
already-traversed-node, the *__crefId* field needs to be written in advance for
each and every record. Hence the need for a separate field in *each* record.
{code:javascript}
"fields" : [ {
"name" : "__crefId",
"type" : "string"
}, ....
{code}
Now, when we do encounter an already-traversed-node, the node must be written
as an ID. Hence every record's type must be a union with string:
{code:javascript}
"type" : [ "null", "CircularList", "string" ]
{code}
I would be happy to consider other options if the above seems incorrect.
If it seems correct, +I will submit a patch without non-string map-keys+.
\\
\\
\\
[~martinkl], Currently Avro supports circular references in schema.
So supporting circular references in data should be a natural extension of the
same.
Also, circular references are very common in ORM (like Hibernate/JPA) and Java
based programs in general.
http://stackoverflow.com/questions/11007247/are-circular-references-in-jpa-an-antipattern
And parsers like Gson and Jackson support this feature too.
The serialized data from the above patch should work with all language
implementations and also with Hive/Pig (because we are breaking the circular
reference by changing it to an ID).
Please share if you think otherwise.
> Cycle Reference Support
> -----------------------
>
> Key: AVRO-695
> URL: https://issues.apache.org/jira/browse/AVRO-695
> Project: Avro
> Issue Type: New Feature
> Components: spec
> Affects Versions: 1.7.6
> Reporter: Moustapha Cherri
> Attachments: avro-1.4.1-cycle.patch.gz, avro-1.4.1-cycle.patch.gz,
> avro_circular_references.zip, avro_circular_refs_2014_06_14.zip,
> circular_refs_and_nonstring_map_keys_2014_06_25.zip
>
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> This is a proposed implementation to add cycle reference support to Avro. It
> basically introduce a new type named Cycle. Cycles contains a string
> representing the path to the other reference.
> For example if we have an object of type Message that have a member named
> previous with type Message too. If we have have this hierarchy:
> message
> previous : message2
> message2
> previous : message2
> When serializing the cycle path for "message2.previous" will be "previous".
> The implementation depend on ANTLR to evaluate those cycle at read time to
> resolve them. I used ANTLR 3.2. This dependency is not mandated; I just used
> ANTLR to speed thing up. I kept in this implementation the generated code
> from ANTLR though this should not be the case as this should be generated
> during the build. I only updated the Java code.
> I did not make full unit testing but you can find "avrotest.Main" class that
> can be used a preliminary test.
> Please do not hesitate to contact me for further clarification if this seems
> interresting.
> Best regards,
> Moustapha Cherri
--
This message was sent by Atlassian JIRA
(v6.2#6252)