[
https://issues.apache.org/jira/browse/AVRO-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Fokko Driesprong updated AVRO-1467:
-----------------------------------
Fix Version/s: (was: 1.9.0)
1.10.0
> Schema resolution does not check record names
> ---------------------------------------------
>
> Key: AVRO-1467
> URL: https://issues.apache.org/jira/browse/AVRO-1467
> Project: Apache Avro
> Issue Type: Bug
> Components: java
> Affects Versions: 1.7.6
> Reporter: Jim Pivarski
> Priority: Major
> Fix For: 1.10.0
>
>
> According to http://avro.apache.org/docs/1.7.6/spec.html#Schema+Resolution ,
> writer and reader schemae should be considered compatible if they (1) have
> the same name and (2) the reader requests a subset of the writer's fields
> with compatible types. In the Java version, I find that the structure of the
> fields is checked but the name is _not_ checked. (It's too permissive; acts
> like a structural type check, rather than structural and nominal.)
> Here's a demonstration (in the Scala REPL to allow for experimentation;
> launch with "scala -cp avro-tools-1.7.6.jar" to get all the classes). The
> following writes a small, valid Avro data file:
> {code:java}
> import org.apache.avro.file.DataFileReader
> import org.apache.avro.file.DataFileWriter
> import org.apache.avro.generic.GenericData
> import org.apache.avro.generic.GenericDatumReader
> import org.apache.avro.generic.GenericDatumWriter
> import org.apache.avro.generic.GenericRecord
> import org.apache.avro.io.DatumReader
> import org.apache.avro.io.DatumWriter
> import org.apache.avro.Schema
> val parser = new Schema.Parser
> // The name is different but the fields are the same.
> val writerSchema = parser.parse("""{"type": "record", "name": "Writer",
> "fields": [{"name": "one", "type": "int"}, {"name": "two", "type":
> "string"}]}""")
> val readerSchema = parser.parse("""{"type": "record", "name": "Reader",
> "fields": [{"name": "one", "type": "int"}, {"name": "two", "type":
> "string"}]}""")
> def makeRecord(one: Int, two: String): GenericRecord = {
> val out = new GenericData.Record(writerSchema)
> out.put("one", one)
> out.put("two", two)
> out
> }
> val datumWriter = new GenericDatumWriter[GenericRecord](writerSchema)
> val dataFileWriter = new DataFileWriter[GenericRecord](datumWriter)
> dataFileWriter.create(writerSchema, new java.io.File("/tmp/test.avro"))
> dataFileWriter.append(makeRecord(1, "one"))
> dataFileWriter.append(makeRecord(2, "two"))
> dataFileWriter.append(makeRecord(3, "three"))
> dataFileWriter.close()
> {code}
> Looking at the output with "hexdump -C /tmp/test.avro", we see that the
> writer schema is embedded in the file, and the record's name is "Writer". To
> read it back:
> {code:java}
> val datumReader = new GenericDatumReader[GenericRecord](writerSchema,
> readerSchema)
> val dataFileReader = new DataFileReader[GenericRecord](new
> java.io.File("/tmp/test.avro"), datumReader)
> while (dataFileReader.hasNext) {
> val in = dataFileReader.next()
> println(in, in.getSchema)
> }
> {code}
> The problem is that the above is successful, even though I'm requesting a
> record with name "Reader".
> If I make structurally incompatible records, for instance by writing with
> "Writer.two" being an integer and "Reader.two" being a string, it fails to
> read with org.apache.avro.AvroTypeException (as it should). If I try the
> above test with an enum type or a fixed type, it _does_ require the writer
> and reader names to match: record is the only named type for which the name
> is ignored during schema resolution.
> We're supposed to use aliases to explicitly declare which structurally
> compatible writer-reader combinations to accept. Because of the above bug,
> differently named records are accepted regardless of their aliases, but enums
> and fixed types are not accepted, even if they have the right aliases. This
> may be a separate bug, or it may be related to the above.
> To make sure that I'm correctly understanding the specification, I tried
> exactly the same thing in the Python version:
> {code:python}
> import avro.schema
> from avro.datafile import DataFileReader, DataFileWriter
> from avro.io import DatumReader, DatumWriter
> writerSchema = avro.schema.parse('{"type": "record", "name": "Writer",
> "fields": [{"name": "one", "type": "int"}, {"name": "two", "type":
> "string"}]}')
> readerSchema = avro.schema.parse('{"type": "record", "name": "Reader",
> "fields": [{"name": "one", "type": "int"}, {"name": "two", "type":
> "string"}]}')
> writer = DataFileWriter(open("/tmp/test2.avro", "w"), DatumWriter(),
> writerSchema)
> writer.append({"one": 1, "two": "one"})
> writer.append({"one": 2, "two": "two"})
> writer.append({"one": 3, "two": "three"})
> writer.close()
> reader = DataFileReader(open("/tmp/test2.avro"), DatumReader(None,
> readerSchema))
> for datum in reader:
> print datum
> {code}
> The Python code fails in the first read with
> avro.io.SchemaResolutionException, as it is supposed to. (Interestingly,
> Python ignores the aliases as well, which I think it's not supposed to do.
> Since the Java and Python versions both have the same behavior with regard to
> aliases, I wonder if I'm understanding
> http://avro.apache.org/docs/1.7.6/spec.html#Aliases correctly.)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)