[ 
https://issues.apache.org/jira/browse/AVRO-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Pivarski updated AVRO-1422:
-------------------------------

    Description: 
A schema defined like this:

{code:title=recursiveSchema.avsc|borderStyle=solid}
{"type": "record",
 "name": "RecursiveRecord",
 "fields": [
   {"name": "child", "type": "RecursiveRecord"}
 ]}
{code}

results in an infinite loop/stack overflow when ingesting JSON that looks like 
{{{"child": null}}} or {{{"child": {"null": null}}}}.  For instance, I can 
compile and load the schema into a Scala REPL and then cause the error when 
trying to read in the JSON, like this:

{code:title=command-line-1|borderStyle=solid}
java -jar avro-tools-1.7.5.jar compile schema recursiveSchema.avsc .
javac RecursiveRecord.java -cp avro-tools-1.7.5.jar
scala -cp avro-tools-1.7.5.jar:.
{code}
{code:title=scala-repl-specific-1|borderStyle=solid}
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.Schema;
import org.apache.avro.specific.SpecificDatumReader;

var output: RecursiveRecord = new RecursiveRecord();
val schema: Schema = output.getSchema();
val reader: SpecificDatumReader[RecursiveRecord] = new 
SpecificDatumReader[RecursiveRecord](schema);
output = reader.read(output, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": null}"""));
output = reader.read(output, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": {"null": null}}"""));
{code}

The same is true if I attempt to load it into a generic object:

{code:title=scala-repl-generic-1|borderStyle=solid}
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericDatumReader;

val parser = new Schema.Parser();
val schema: Schema = parser.parse("""{"type": "record", "name": 
"RecursiveRecord", "fields": [{"name": "child", "type": 
"RecursiveRecord"}]}""");
val reader: GenericDatumReader[java.lang.Object] = new 
GenericDatumReader[java.lang.Object](schema);
val output = reader.read(null, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": null}"""));
val output = reader.read(null, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": {"null": null}}"""));
{code}

In all cases, it is the {{reader.read}} calls that cause stack overflows (all 
four of the ones described above).  The stack trace is apparently truncated, 
but what is shown repeats these two lines until cut off by the JVM:
{code:title=stack-trace|borderStyle=solid}
        at 
org.apache.avro.io.parsing.Symbol$Sequence.flattenedSize(Symbol.java:324)
        at org.apache.avro.io.parsing.Symbol.flattenedSize(Symbol.java:217)
{code}

The same is not true if we (correctly?) declare the child as a union of null 
and a recursive record.  For instance,

{code:title=recursiveSchema2.avsc|borderStyle=solid}
{"type": "record",
 "name": "RecursiveRecord2",
 "fields": [
   {"name": "child", "type": ["RecursiveRecord2", "null"]}
 ]}
{code}
{code:title=command-line-2|borderStyle=solid}
java -jar avro-tools-1.7.5.jar compile schema recursiveSchema2.avsc .
javac RecursiveRecord2.java -cp avro-tools-1.7.5.jar
scala -cp avro-tools-1.7.5.jar:.
{code}
{code:title=scala-repl-specific-2|borderStyle=solid}
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.Schema;
import org.apache.avro.specific.SpecificDatumReader;

var output: RecursiveRecord2 = new RecursiveRecord2();
val schema: Schema = output.getSchema();
val reader: SpecificDatumReader[RecursiveRecord2] = new 
SpecificDatumReader[RecursiveRecord2](schema);
output = reader.read(output, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": null}"""));
output = reader.read(output, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": {"null": null}}"""));
{code}
{code:title=scala-repl-generic-2|borderStyle=solid}
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericDatumReader;

val parser = new Schema.Parser()
val schema: Schema = parser.parse("""{"type": "record", "name": 
"RecursiveRecord2", "fields": [{"name": "child", "type": ["RecursiveRecord2", 
"null"]}]}""");
val reader: GenericDatumReader[java.lang.Object] = new 
GenericDatumReader[java.lang.Object](schema);
val output = reader.read(null, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": null}"""));
val output = reader.read(null, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": {"null": null}}"""));
{code}

For both specific and generic, {{RecursiveRecord2}} works properly: it produces 
an object with recursive type and {{child == null}}.

My understanding of the official schema is that only {{RecursiveRecord2}} 
should be allowed to have a null {{child}}, so the JSON I supplied would not 
have been valid input for {{RecursiveRecord}}.  (If so, then it wouldn't even 
be possible to give it valid finite input.)  However, it should give a 
different error than a stack overflow, something to tell me that {{{"child": 
null}}} is not legal unless field {{child}} is declared as a union that 
includes {{null}}.

The reason one might want this (recursively defined types) is to make trees.  
The example I gave had only one child for simplicity (i.e. it was a linked 
list), but the error would apply to binary trees as well.  For instance, here's 
a three-node list (a little cumbersome in JSON):

{code:title=motivating-example|borderStyle=solid}
{"child": {"RecursiveRecord2": {"child": {"RecursiveRecord2": {"child": 
null}}}}}
{code}

I haven't tested this in Avro deserialization (which would be a more reasonable 
use-case), but I don't know of a way to generate the Avro-encoded data without 
first getting it from human-typable JSON.  (I'm not constructing the Avro byte 
stream by hand.)


  was:
A schema defined like this:

{code:title=badSchema.avsc|borderStyle=solid}
{"type": "record",
 "name": "RecursiveRecord",
 "fields": [
   {"name": "child", "type": "RecursiveRecord"}
 ]}
{code}

results in an infinite loop/stack overflow when ingesting JSON that looks like 
{{{"child": null}}} or {{{"child": {"null": null}}}}.  For instance, I can 
compile and load the schema into a Scala REPL and then cause the error when 
trying to read in the JSON, like this:

{code:title=command-line-1|borderStyle=solid}
java -jar avro-tools-1.7.5.jar compile schema recursiveSchema.avsc .
javac RecursiveRecord.java -cp avro-tools-1.7.5.jar
scala -cp avro-tools-1.7.5.jar:.
{code}
{code:title=scala-repl-specific-1|borderStyle=solid}
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.Schema;
import org.apache.avro.specific.SpecificDatumReader;

var output: RecursiveRecord = new RecursiveRecord();
val schema: Schema = output.getSchema();
val reader: SpecificDatumReader[RecursiveRecord] = new 
SpecificDatumReader[RecursiveRecord](schema);
output = reader.read(output, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": null}"""));
output = reader.read(output, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": {"null": null}}"""));
{code}

The same is true if I attempt to load it into a generic object:

{code:title=scala-repl-generic-1|borderStyle=solid}
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericDatumReader;

val parser = new Schema.Parser();
val schema: Schema = parser.parse("""{"type": "record", "name": 
"RecursiveRecord", "fields": [{"name": "child", "type": 
"RecursiveRecord"}]}""");
val reader: GenericDatumReader[java.lang.Object] = new 
GenericDatumReader[java.lang.Object](schema);
val output = reader.read(null, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": null}"""));
val output = reader.read(null, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": {"null": null}}"""));
{code}

In all cases, it is the {{reader.read}} calls that cause stack overflows (all 
four of the ones described above).  The stack trace is apparently truncated, 
but what is shown repeats these two lines until cut off by the JVM:
{code:title=stack-trace|borderStyle=solid}
        at 
org.apache.avro.io.parsing.Symbol$Sequence.flattenedSize(Symbol.java:324)
        at org.apache.avro.io.parsing.Symbol.flattenedSize(Symbol.java:217)
{code}

The same is not true if we (correctly?) declare the child as a union of null 
and a recursive record.  For instance,

{code:title=goodSchema.avsc|borderStyle=solid}
{"type": "record",
 "name": "RecursiveRecord2",
 "fields": [
   {"name": "child", "type": ["RecursiveRecord2", "null"]}
 ]}
{code}
{code:title=command-line-2|borderStyle=solid}
java -jar avro-tools-1.7.5.jar compile schema recursiveSchema2.avsc .
javac RecursiveRecord2.java -cp avro-tools-1.7.5.jar
scala -cp avro-tools-1.7.5.jar:.
{code}
{code:title=scala-repl-specific-2|borderStyle=solid}
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.Schema;
import org.apache.avro.specific.SpecificDatumReader;

var output: RecursiveRecord2 = new RecursiveRecord2();
val schema: Schema = output.getSchema();
val reader: SpecificDatumReader[RecursiveRecord2] = new 
SpecificDatumReader[RecursiveRecord2](schema);
output = reader.read(output, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": null}"""));
output = reader.read(output, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": {"null": null}}"""));
{code}
{code:title=scala-repl-generic-2|borderStyle=solid}
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericDatumReader;

val parser = new Schema.Parser()
val schema: Schema = parser.parse("""{"type": "record", "name": 
"RecursiveRecord2", "fields": [{"name": "child", "type": ["RecursiveRecord2", 
"null"]}]}""");
val reader: GenericDatumReader[java.lang.Object] = new 
GenericDatumReader[java.lang.Object](schema);
val output = reader.read(null, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": null}"""));
val output = reader.read(null, DecoderFactory.get().jsonDecoder(schema, 
"""{"child": {"null": null}}"""));
{code}

For both specific and generic, {{RecursiveRecord2}} works properly: it produces 
an object with recursive type and {{child == null}}.

My understanding of the official schema is that only {{RecursiveRecord2}} 
should be allowed to have a null {{child}}, so the JSON I supplied would not 
have been valid input for {{RecursiveRecord}}.  (If so, then it wouldn't even 
be possible to give it valid finite input.)  However, it should give a 
different error than a stack overflow, something to tell me that {{{"child": 
null}}} is not legal unless field {{child}} is declared as a union that 
includes {{null}}.

The reason one might want this (recursively defined types) is to make trees.  
The example I gave had only one child for simplicity (i.e. it was a linked 
list), but the error would apply to binary trees as well.  For instance, here's 
a three-node list (a little cumbersome in JSON):

{code:title=motivating-example|borderStyle=solid}
{"child": {"RecursiveRecord2": {"child": {"RecursiveRecord2": {"child": 
null}}}}}
{code}

I haven't tested this in Avro deserialization (which would be a more reasonable 
use-case), but I don't know of a way to generate the Avro-encoded data without 
first getting it from human-typable JSON.  (I'm not constructing the Avro byte 
stream by hand.)



> JSON-deserialization of recursively defined record causes stack overflow
> ------------------------------------------------------------------------
>
>                 Key: AVRO-1422
>                 URL: https://issues.apache.org/jira/browse/AVRO-1422
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.7.5
>         Environment: Linux (but it doesn't matter because it's Java).
>            Reporter: Jim Pivarski
>              Labels: infinite-loop, recursive, stack-overflow
>
> A schema defined like this:
> {code:title=recursiveSchema.avsc|borderStyle=solid}
> {"type": "record",
>  "name": "RecursiveRecord",
>  "fields": [
>    {"name": "child", "type": "RecursiveRecord"}
>  ]}
> {code}
> results in an infinite loop/stack overflow when ingesting JSON that looks 
> like {{{"child": null}}} or {{{"child": {"null": null}}}}.  For instance, I 
> can compile and load the schema into a Scala REPL and then cause the error 
> when trying to read in the JSON, like this:
> {code:title=command-line-1|borderStyle=solid}
> java -jar avro-tools-1.7.5.jar compile schema recursiveSchema.avsc .
> javac RecursiveRecord.java -cp avro-tools-1.7.5.jar
> scala -cp avro-tools-1.7.5.jar:.
> {code}
> {code:title=scala-repl-specific-1|borderStyle=solid}
> import org.apache.avro.io.DecoderFactory;
> import org.apache.avro.Schema;
> import org.apache.avro.specific.SpecificDatumReader;
> var output: RecursiveRecord = new RecursiveRecord();
> val schema: Schema = output.getSchema();
> val reader: SpecificDatumReader[RecursiveRecord] = new 
> SpecificDatumReader[RecursiveRecord](schema);
> output = reader.read(output, DecoderFactory.get().jsonDecoder(schema, 
> """{"child": null}"""));
> output = reader.read(output, DecoderFactory.get().jsonDecoder(schema, 
> """{"child": {"null": null}}"""));
> {code}
> The same is true if I attempt to load it into a generic object:
> {code:title=scala-repl-generic-1|borderStyle=solid}
> import org.apache.avro.io.DecoderFactory;
> import org.apache.avro.Schema;
> import org.apache.avro.generic.GenericDatumReader;
> val parser = new Schema.Parser();
> val schema: Schema = parser.parse("""{"type": "record", "name": 
> "RecursiveRecord", "fields": [{"name": "child", "type": 
> "RecursiveRecord"}]}""");
> val reader: GenericDatumReader[java.lang.Object] = new 
> GenericDatumReader[java.lang.Object](schema);
> val output = reader.read(null, DecoderFactory.get().jsonDecoder(schema, 
> """{"child": null}"""));
> val output = reader.read(null, DecoderFactory.get().jsonDecoder(schema, 
> """{"child": {"null": null}}"""));
> {code}
> In all cases, it is the {{reader.read}} calls that cause stack overflows (all 
> four of the ones described above).  The stack trace is apparently truncated, 
> but what is shown repeats these two lines until cut off by the JVM:
> {code:title=stack-trace|borderStyle=solid}
>         at 
> org.apache.avro.io.parsing.Symbol$Sequence.flattenedSize(Symbol.java:324)
>         at org.apache.avro.io.parsing.Symbol.flattenedSize(Symbol.java:217)
> {code}
> The same is not true if we (correctly?) declare the child as a union of null 
> and a recursive record.  For instance,
> {code:title=recursiveSchema2.avsc|borderStyle=solid}
> {"type": "record",
>  "name": "RecursiveRecord2",
>  "fields": [
>    {"name": "child", "type": ["RecursiveRecord2", "null"]}
>  ]}
> {code}
> {code:title=command-line-2|borderStyle=solid}
> java -jar avro-tools-1.7.5.jar compile schema recursiveSchema2.avsc .
> javac RecursiveRecord2.java -cp avro-tools-1.7.5.jar
> scala -cp avro-tools-1.7.5.jar:.
> {code}
> {code:title=scala-repl-specific-2|borderStyle=solid}
> import org.apache.avro.io.DecoderFactory;
> import org.apache.avro.Schema;
> import org.apache.avro.specific.SpecificDatumReader;
> var output: RecursiveRecord2 = new RecursiveRecord2();
> val schema: Schema = output.getSchema();
> val reader: SpecificDatumReader[RecursiveRecord2] = new 
> SpecificDatumReader[RecursiveRecord2](schema);
> output = reader.read(output, DecoderFactory.get().jsonDecoder(schema, 
> """{"child": null}"""));
> output = reader.read(output, DecoderFactory.get().jsonDecoder(schema, 
> """{"child": {"null": null}}"""));
> {code}
> {code:title=scala-repl-generic-2|borderStyle=solid}
> import org.apache.avro.io.DecoderFactory;
> import org.apache.avro.Schema;
> import org.apache.avro.generic.GenericDatumReader;
> val parser = new Schema.Parser()
> val schema: Schema = parser.parse("""{"type": "record", "name": 
> "RecursiveRecord2", "fields": [{"name": "child", "type": ["RecursiveRecord2", 
> "null"]}]}""");
> val reader: GenericDatumReader[java.lang.Object] = new 
> GenericDatumReader[java.lang.Object](schema);
> val output = reader.read(null, DecoderFactory.get().jsonDecoder(schema, 
> """{"child": null}"""));
> val output = reader.read(null, DecoderFactory.get().jsonDecoder(schema, 
> """{"child": {"null": null}}"""));
> {code}
> For both specific and generic, {{RecursiveRecord2}} works properly: it 
> produces an object with recursive type and {{child == null}}.
> My understanding of the official schema is that only {{RecursiveRecord2}} 
> should be allowed to have a null {{child}}, so the JSON I supplied would not 
> have been valid input for {{RecursiveRecord}}.  (If so, then it wouldn't even 
> be possible to give it valid finite input.)  However, it should give a 
> different error than a stack overflow, something to tell me that {{{"child": 
> null}}} is not legal unless field {{child}} is declared as a union that 
> includes {{null}}.
> The reason one might want this (recursively defined types) is to make trees.  
> The example I gave had only one child for simplicity (i.e. it was a linked 
> list), but the error would apply to binary trees as well.  For instance, 
> here's a three-node list (a little cumbersome in JSON):
> {code:title=motivating-example|borderStyle=solid}
> {"child": {"RecursiveRecord2": {"child": {"RecursiveRecord2": {"child": 
> null}}}}}
> {code}
> I haven't tested this in Avro deserialization (which would be a more 
> reasonable use-case), but I don't know of a way to generate the Avro-encoded 
> data without first getting it from human-typable JSON.  (I'm not constructing 
> the Avro byte stream by hand.)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to