[ https://issues.apache.org/jira/browse/AVRO-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ismaël Mejía updated AVRO-1467: ------------------------------- Fix Version/s: (was: 1.10.0) > Schema resolution does not check record names > --------------------------------------------- > > Key: AVRO-1467 > URL: https://issues.apache.org/jira/browse/AVRO-1467 > Project: Apache Avro > Issue Type: Bug > Components: java > Affects Versions: 1.7.6 > Reporter: Jim Pivarski > Priority: Major > > According to http://avro.apache.org/docs/1.7.6/spec.html#Schema+Resolution , > writer and reader schemae should be considered compatible if they (1) have > the same name and (2) the reader requests a subset of the writer's fields > with compatible types. In the Java version, I find that the structure of the > fields is checked but the name is _not_ checked. (It's too permissive; acts > like a structural type check, rather than structural and nominal.) > Here's a demonstration (in the Scala REPL to allow for experimentation; > launch with "scala -cp avro-tools-1.7.6.jar" to get all the classes). The > following writes a small, valid Avro data file: > {code:java} > import org.apache.avro.file.DataFileReader > import org.apache.avro.file.DataFileWriter > import org.apache.avro.generic.GenericData > import org.apache.avro.generic.GenericDatumReader > import org.apache.avro.generic.GenericDatumWriter > import org.apache.avro.generic.GenericRecord > import org.apache.avro.io.DatumReader > import org.apache.avro.io.DatumWriter > import org.apache.avro.Schema > val parser = new Schema.Parser > // The name is different but the fields are the same. > val writerSchema = parser.parse("""{"type": "record", "name": "Writer", > "fields": [{"name": "one", "type": "int"}, {"name": "two", "type": > "string"}]}""") > val readerSchema = parser.parse("""{"type": "record", "name": "Reader", > "fields": [{"name": "one", "type": "int"}, {"name": "two", "type": > "string"}]}""") > def makeRecord(one: Int, two: String): GenericRecord = { > val out = new GenericData.Record(writerSchema) > out.put("one", one) > out.put("two", two) > out > } > val datumWriter = new GenericDatumWriter[GenericRecord](writerSchema) > val dataFileWriter = new DataFileWriter[GenericRecord](datumWriter) > dataFileWriter.create(writerSchema, new java.io.File("/tmp/test.avro")) > dataFileWriter.append(makeRecord(1, "one")) > dataFileWriter.append(makeRecord(2, "two")) > dataFileWriter.append(makeRecord(3, "three")) > dataFileWriter.close() > {code} > Looking at the output with "hexdump -C /tmp/test.avro", we see that the > writer schema is embedded in the file, and the record's name is "Writer". To > read it back: > {code:java} > val datumReader = new GenericDatumReader[GenericRecord](writerSchema, > readerSchema) > val dataFileReader = new DataFileReader[GenericRecord](new > java.io.File("/tmp/test.avro"), datumReader) > while (dataFileReader.hasNext) { > val in = dataFileReader.next() > println(in, in.getSchema) > } > {code} > The problem is that the above is successful, even though I'm requesting a > record with name "Reader". > If I make structurally incompatible records, for instance by writing with > "Writer.two" being an integer and "Reader.two" being a string, it fails to > read with org.apache.avro.AvroTypeException (as it should). If I try the > above test with an enum type or a fixed type, it _does_ require the writer > and reader names to match: record is the only named type for which the name > is ignored during schema resolution. > We're supposed to use aliases to explicitly declare which structurally > compatible writer-reader combinations to accept. Because of the above bug, > differently named records are accepted regardless of their aliases, but enums > and fixed types are not accepted, even if they have the right aliases. This > may be a separate bug, or it may be related to the above. > To make sure that I'm correctly understanding the specification, I tried > exactly the same thing in the Python version: > {code:python} > import avro.schema > from avro.datafile import DataFileReader, DataFileWriter > from avro.io import DatumReader, DatumWriter > writerSchema = avro.schema.parse('{"type": "record", "name": "Writer", > "fields": [{"name": "one", "type": "int"}, {"name": "two", "type": > "string"}]}') > readerSchema = avro.schema.parse('{"type": "record", "name": "Reader", > "fields": [{"name": "one", "type": "int"}, {"name": "two", "type": > "string"}]}') > writer = DataFileWriter(open("/tmp/test2.avro", "w"), DatumWriter(), > writerSchema) > writer.append({"one": 1, "two": "one"}) > writer.append({"one": 2, "two": "two"}) > writer.append({"one": 3, "two": "three"}) > writer.close() > reader = DataFileReader(open("/tmp/test2.avro"), DatumReader(None, > readerSchema)) > for datum in reader: > print datum > {code} > The Python code fails in the first read with > avro.io.SchemaResolutionException, as it is supposed to. (Interestingly, > Python ignores the aliases as well, which I think it's not supposed to do. > Since the Java and Python versions both have the same behavior with regard to > aliases, I wonder if I'm understanding > http://avro.apache.org/docs/1.7.6/spec.html#Aliases correctly.) -- This message was sent by Atlassian Jira (v8.3.4#803005)