[ 
https://issues.apache.org/jira/browse/AVRO-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated AVRO-1467:
-------------------------------
    Fix Version/s:     (was: 1.10.0)

> Schema resolution does not check record names
> ---------------------------------------------
>
>                 Key: AVRO-1467
>                 URL: https://issues.apache.org/jira/browse/AVRO-1467
>             Project: Apache Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.7.6
>            Reporter: Jim Pivarski
>            Priority: Major
>
> According to http://avro.apache.org/docs/1.7.6/spec.html#Schema+Resolution , 
> writer and reader schemae should be considered compatible if they (1) have 
> the same name and (2) the reader requests a subset of the writer's fields 
> with compatible types.  In the Java version, I find that the structure of the 
> fields is checked but the name is _not_ checked.  (It's too permissive; acts 
> like a structural type check, rather than structural and nominal.)
> Here's a demonstration (in the Scala REPL to allow for experimentation; 
> launch with "scala -cp avro-tools-1.7.6.jar" to get all the classes).  The 
> following writes a small, valid Avro data file:
> {code:java}
> import org.apache.avro.file.DataFileReader
> import org.apache.avro.file.DataFileWriter
> import org.apache.avro.generic.GenericData
> import org.apache.avro.generic.GenericDatumReader
> import org.apache.avro.generic.GenericDatumWriter
> import org.apache.avro.generic.GenericRecord
> import org.apache.avro.io.DatumReader
> import org.apache.avro.io.DatumWriter
> import org.apache.avro.Schema
> val parser = new Schema.Parser
> // The name is different but the fields are the same.
> val writerSchema = parser.parse("""{"type": "record", "name": "Writer", 
> "fields": [{"name": "one", "type": "int"}, {"name": "two", "type": 
> "string"}]}""")
> val readerSchema = parser.parse("""{"type": "record", "name": "Reader", 
> "fields": [{"name": "one", "type": "int"}, {"name": "two", "type": 
> "string"}]}""")
> def makeRecord(one: Int, two: String): GenericRecord = {
>   val out = new GenericData.Record(writerSchema)
>   out.put("one", one)
>   out.put("two", two)
>   out
> }
> val datumWriter = new GenericDatumWriter[GenericRecord](writerSchema)
> val dataFileWriter = new DataFileWriter[GenericRecord](datumWriter)
> dataFileWriter.create(writerSchema, new java.io.File("/tmp/test.avro"))
> dataFileWriter.append(makeRecord(1, "one"))
> dataFileWriter.append(makeRecord(2, "two"))
> dataFileWriter.append(makeRecord(3, "three"))
> dataFileWriter.close()
> {code}
> Looking at the output with "hexdump -C /tmp/test.avro", we see that the 
> writer schema is embedded in the file, and the record's name is "Writer".  To 
> read it back:
> {code:java}
> val datumReader = new GenericDatumReader[GenericRecord](writerSchema, 
> readerSchema)
> val dataFileReader = new DataFileReader[GenericRecord](new 
> java.io.File("/tmp/test.avro"), datumReader)
> while (dataFileReader.hasNext) {
>   val in = dataFileReader.next()
>   println(in, in.getSchema)
> }
> {code}
> The problem is that the above is successful, even though I'm requesting a 
> record with name "Reader".
> If I make structurally incompatible records, for instance by writing with 
> "Writer.two" being an integer and "Reader.two" being a string, it fails to 
> read with org.apache.avro.AvroTypeException (as it should).  If I try the 
> above test with an enum type or a fixed type, it _does_ require the writer 
> and reader names to match: record is the only named type for which the name 
> is ignored during schema resolution.
> We're supposed to use aliases to explicitly declare which structurally 
> compatible writer-reader combinations to accept.  Because of the above bug, 
> differently named records are accepted regardless of their aliases, but enums 
> and fixed types are not accepted, even if they have the right aliases.  This 
> may be a separate bug, or it may be related to the above.
> To make sure that I'm correctly understanding the specification, I tried 
> exactly the same thing in the Python version:
> {code:python}
> import avro.schema
> from avro.datafile import DataFileReader, DataFileWriter
> from avro.io import DatumReader, DatumWriter
> writerSchema = avro.schema.parse('{"type": "record", "name": "Writer", 
> "fields": [{"name": "one", "type": "int"}, {"name": "two", "type": 
> "string"}]}')
> readerSchema = avro.schema.parse('{"type": "record", "name": "Reader", 
> "fields": [{"name": "one", "type": "int"}, {"name": "two", "type": 
> "string"}]}')
> writer = DataFileWriter(open("/tmp/test2.avro", "w"), DatumWriter(), 
> writerSchema)
> writer.append({"one": 1, "two": "one"})
> writer.append({"one": 2, "two": "two"})
> writer.append({"one": 3, "two": "three"})
> writer.close()
> reader = DataFileReader(open("/tmp/test2.avro"), DatumReader(None, 
> readerSchema))
> for datum in reader:
>     print datum
> {code}
> The Python code fails in the first read with 
> avro.io.SchemaResolutionException, as it is supposed to.  (Interestingly, 
> Python ignores the aliases as well, which I think it's not supposed to do.  
> Since the Java and Python versions both have the same behavior with regard to 
> aliases, I wonder if I'm understanding 
> http://avro.apache.org/docs/1.7.6/spec.html#Aliases correctly.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to