I don't think this is necessary. For char and varchar, the underlying storage shouldn't actually do anything differently. For example, what should Avro do if the user writes a long string to a VARCHAR(16) field? I think the last thing Avro should do is drop the extra bytes, so we're forced to do nothing and store the data as requested. Same thing on read: Avro should pass whatever string was written, regardless of the logical type and the engine should truncate.
There's also no benefit to these types. UTF-8 may have multi-byte characters, so we can't use a fixed-length buffer for storage. CHAR and VARCHAR are, in my opinion, antiquated database types that don't have any value at the storage layer. I think it makes sense for Hive or Spark to allow users to get the same behavior, but that should be implemented at the database level, not at the file level. Do you know why Hive is storing these annotations in Avro? If I remember correctly, it is to get around passing the table's types to the read path, which isn't a good reason to add this in the Avro spec, when the expected behavior is to do nothing differently (which is itself probably confusing at first glance). rb On Thu, Oct 19, 2017 at 3:19 AM, Zoltan Ivanfi <[email protected]> wrote: > Hi, > > Apparently, when saving char or varchar columns to Avro, Hive and Spark add > non-standard logical type annotations: > > {"type":"string","logicalType":"char","maxLength":42} > {"type":"string","logicalType":"varchar","maxLength":42} > > Considering that probably these two SQL engines are the creators of the > majority of all Avro files written so far, I was wondering whether we > should make these annotations official by adding them to the specification. > > Any opinions? > > Thanks, > > Zoltan > -- Ryan Blue Software Engineer Netflix
