[jira] [Commented] (AVRO-1467) Schema resolution does not check record names

Jim Pivarski (JIRA) Fri, 28 Feb 2014 09:12:31 -0800

    [ 
https://issues.apache.org/jira/browse/AVRO-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13916018#comment-13916018
 ]


Jim Pivarski commented on AVRO-1467:
------------------------------------

Regarding: "Record names are not checked during Java schema resolution. 
Changing this might break applications that currently work, so a fix should 
wait until Avro 1.8.0."

Some applications might inadvertently rely on this behavior, so I understand 
why you'd want to wait for a backward-incompatible release to make this change. 
 But perhaps there are also applications that intentionally use this feature: 
for instance, you don't care what record names you're going to receive, but you 
need them to have a particular subset of fields.

I'm developing a decision tree processor, and I want to make sure that an input 
record type has a particular form of predicate and branches to follow if the 
predicate passes or fails, like this:
{code:json}
{"type": "record", "name": "Tree", "fields": [
    {"name": "field", "type": {"type": "enum", "name": "Fields", "symbols": 
["one", "two", "three"]}},
    {"name": "comparison", "type": {"type": "enum", "name": "Comparisons", 
"symbols": ["lessThan", "equalTo"]}},
    {"name": "value", "type": "double"},
    {"name": "pass", "type": ["Tree", {"type": "enum", "name": "Score", 
"symbols": ["class1", "class2", "class3", "class4"]}]},
    {"name": "fail", "type": ["Tree", "Score"]}
    ]}
{code}
(I could relax some of the the enums to be arbitrary strings in cases where I 
don't know all the fields; this is just for illustration.)  The input tree 
models may have metadata in the form of unspecified fields, such as {"name": 
"numberOfTrainingSamples", "type": "int"}, and I'm using Avro's schema 
resolution to ignore them.  They may appear in the writer's schema and are 
therefore embedded in the model for future reference, but they don't appear in 
the reader's schema (the decision tree processor) because they're irrelevant to 
scoring.

When Avro 1.8.x enforces record names, it could be useful be able to say 
something like "aliases": ["*"] to accept input records with any name but the 
correct structure.  That way, I'm not putting a restriction on the namespaces 
of the input records; they can be "com.mycompany.Tree" or 
"com.yourcompany.Tree" or whatever.

Should I open a ticket to request "aliases": ["*"]?


> Schema resolution does not check record names
> ---------------------------------------------
>
>                 Key: AVRO-1467
>                 URL: https://issues.apache.org/jira/browse/AVRO-1467
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.7.6
>            Reporter: Jim Pivarski
>             Fix For: 1.8.0
>
>
> According to http://avro.apache.org/docs/1.7.6/spec.html#Schema+Resolution , 
> writer and reader schemae should be considered compatible if they (1) have 
> the same name and (2) the reader requests a subset of the writer's fields 
> with compatible types.  In the Java version, I find that the structure of the 
> fields is checked but the name is _not_ checked.  (It's too permissive; acts 
> like a structural type check, rather than structural and nominal.)
> Here's a demonstration (in the Scala REPL to allow for experimentation; 
> launch with "scala -cp avro-tools-1.7.6.jar" to get all the classes).  The 
> following writes a small, valid Avro data file:
> {code:java}
> import org.apache.avro.file.DataFileReader
> import org.apache.avro.file.DataFileWriter
> import org.apache.avro.generic.GenericData
> import org.apache.avro.generic.GenericDatumReader
> import org.apache.avro.generic.GenericDatumWriter
> import org.apache.avro.generic.GenericRecord
> import org.apache.avro.io.DatumReader
> import org.apache.avro.io.DatumWriter
> import org.apache.avro.Schema
> val parser = new Schema.Parser
> // The name is different but the fields are the same.
> val writerSchema = parser.parse("""{"type": "record", "name": "Writer", 
> "fields": [{"name": "one", "type": "int"}, {"name": "two", "type": 
> "string"}]}""")
> val readerSchema = parser.parse("""{"type": "record", "name": "Reader", 
> "fields": [{"name": "one", "type": "int"}, {"name": "two", "type": 
> "string"}]}""")
> def makeRecord(one: Int, two: String): GenericRecord = {
>   val out = new GenericData.Record(writerSchema)
>   out.put("one", one)
>   out.put("two", two)
>   out
> }
> val datumWriter = new GenericDatumWriter[GenericRecord](writerSchema)
> val dataFileWriter = new DataFileWriter[GenericRecord](datumWriter)
> dataFileWriter.create(writerSchema, new java.io.File("/tmp/test.avro"))
> dataFileWriter.append(makeRecord(1, "one"))
> dataFileWriter.append(makeRecord(2, "two"))
> dataFileWriter.append(makeRecord(3, "three"))
> dataFileWriter.close()
> {code}
> Looking at the output with "hexdump -C /tmp/test.avro", we see that the 
> writer schema is embedded in the file, and the record's name is "Writer".  To 
> read it back:
> {code:java}
> val datumReader = new GenericDatumReader[GenericRecord](writerSchema, 
> readerSchema)
> val dataFileReader = new DataFileReader[GenericRecord](new 
> java.io.File("/tmp/test.avro"), datumReader)
> while (dataFileReader.hasNext) {
>   val in = dataFileReader.next()
>   println(in, in.getSchema)
> }
> {code}
> The problem is that the above is successful, even though I'm requesting a 
> record with name "Reader".
> If I make structurally incompatible records, for instance by writing with 
> "Writer.two" being an integer and "Reader.two" being a string, it fails to 
> read with org.apache.avro.AvroTypeException (as it should).  If I try the 
> above test with an enum type or a fixed type, it _does_ require the writer 
> and reader names to match: record is the only named type for which the name 
> is ignored during schema resolution.
> We're supposed to use aliases to explicitly declare which structurally 
> compatible writer-reader combinations to accept.  Because of the above bug, 
> differently named records are accepted regardless of their aliases, but enums 
> and fixed types are not accepted, even if they have the right aliases.  This 
> may be a separate bug, or it may be related to the above.
> To make sure that I'm correctly understanding the specification, I tried 
> exactly the same thing in the Python version:
> {code:python}
> import avro.schema
> from avro.datafile import DataFileReader, DataFileWriter
> from avro.io import DatumReader, DatumWriter
> writerSchema = avro.schema.parse('{"type": "record", "name": "Writer", 
> "fields": [{"name": "one", "type": "int"}, {"name": "two", "type": 
> "string"}]}')
> readerSchema = avro.schema.parse('{"type": "record", "name": "Reader", 
> "fields": [{"name": "one", "type": "int"}, {"name": "two", "type": 
> "string"}]}')
> writer = DataFileWriter(open("/tmp/test2.avro", "w"), DatumWriter(), 
> writerSchema)
> writer.append({"one": 1, "two": "one"})
> writer.append({"one": 2, "two": "two"})
> writer.append({"one": 3, "two": "three"})
> writer.close()
> reader = DataFileReader(open("/tmp/test2.avro"), DatumReader(None, 
> readerSchema))
> for datum in reader:
>     print datum
> {code}
> The Python code fails in the first read with 
> avro.io.SchemaResolutionException, as it is supposed to.  (Interestingly, 
> Python ignores the aliases as well, which I think it's not supposed to do.  
> Since the Java and Python versions both have the same behavior with regard to 
> aliases, I wonder if I'm understanding 
> http://avro.apache.org/docs/1.7.6/spec.html#Aliases correctly.)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (AVRO-1467) Schema resolution does not check record names

Reply via email to