[ https://issues.apache.org/jira/browse/AVRO-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371389#comment-15371389 ]
Doug Cutting commented on AVRO-1704: ------------------------------------ I don't see why anyone would prefer the interface to the abstract base class. It seems like belt and suspenders (https://youtube.com/watch?v=VuWzeoIr7J4). Who do we imagine would implement this outside of the project? Frankly, I question this needs even be abstract. Applications will use this API because they want to use Avro's tagged binary encoding for messages. Applications that want an untagged binary encoding can use the existing APIs. The in-memory format is already abstracted, and the encoding is fixed. What we're providing here isn't an extensible framework, it's some utility code. Folks who seek to optimize away the 10-byte overhead can use a DatumWriter & BinaryEncoder as they do today. That's an unsafe encoding and we needn't further simplify it. Our goal is to provide an easy-to-use, safe, standard encoding for messages. I also worry that names like BinaryDatumDecoder are confusing, when we already have BinaryDecoder and DatumReader. We might instead call a so-prefixed binary encoded datum a "message", and have MessageWriter and MessageReader classes that implement this and a MessageSchemaStore, perhaps even placing these all in a new "message" package. I won't reject this patch over these differences in style. I prefer to not hide things behind abstractions until there's clear need. At that point, when multiple implementations are required, one has a better idea of what the abstraction should be. In the mean time, code is substantially smaller, easier to read, debug, maintain, etc. But this is a style issue where reasonable folks might differ. It's hard to believe we don't already have reusable array i/o streams around! Perhaps these should go in the util package so they can be used more widely? I like the convenience methods generated for specific data. We might also add utilities for generic & reflect, like, model#getMessageWriter(Schema)? > Standardized format for encoding messages with Avro > --------------------------------------------------- > > Key: AVRO-1704 > URL: https://issues.apache.org/jira/browse/AVRO-1704 > Project: Avro > Issue Type: Improvement > Reporter: Daniel Schierbeck > Assignee: Niels Basjes > Fix For: 1.9.0, 1.8.3 > > Attachments: AVRO-1704-2016-05-03-Unfinished.patch, > AVRO-1704-20160410.patch > > > I'm currently using the Datafile format for encoding messages that are > written to Kafka and Cassandra. This seems rather wasteful: > 1. I only encode a single record at a time, so there's no need for sync > markers and other metadata related to multi-record files. > 2. The entire schema is inlined every time. > However, the Datafile format is the only one that has been standardized, > meaning that I can read and write data with minimal effort across the various > languages in use in my organization. If there was a standardized format for > encoding single values that was optimized for out-of-band schema transfer, I > would much rather use that. > I think the necessary pieces of the format would be: > 1. A format version number. > 2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc. > 3. The actual schema fingerprint (according to the type.) > 4. Optional metadata map. > 5. The encoded datum. > The language libraries would implement a MessageWriter that would encode > datums in this format, as well as a MessageReader that, given a SchemaStore, > would be able to decode datums. The reader would decode the fingerprint and > ask its SchemaStore to return the corresponding writer's schema. > The idea is that SchemaStore would be an abstract interface that allowed > library users to inject custom backends. A simple, file system based one > could be provided out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332)