Hello all! At Spotify we have some Avro datasets stored in Parquet in our hdfs cluster. As part of our internal data processing framework built atop Apache Crunch, we generally read and transform these records using typed pipelines + generated instances of SpecificRecords from a collective schema repository. The assumption is that in most cases you'd prefer working with a UserSomethingRecord with IDE code completion rather than a GenericRecord with .get calls.
There are however some instances where we'd like to use GenericRecords, especially in infrastructure teams migrating data formats or writing out plain json. These are use cases where we don't care who authored what, we just probably want to do some recursive processing on all the fields and ignore the specific types completely. Parquet version in question here was 1.6.0rc3, which I've now bumped to 1.6.0 after seeing https://issues.apache.org/jira/browse/PARQUET-140. My question is if it is at all possible to coerce the avro representation loaded from our parquet files to use GenericRecord everywhere, even in nested fields with full namespaces to schema classes, and even when such a class can be located by the classloader. When using vanilla avro files, we get this functionality from crunch in the form of the Avros.generics ptype, which takes a schema and types the generated collection to GenericData.Record instances. I noticed the 1.6.0 official release added the AvroDataSupplier interface... is there a possible implementation of this that could behave as needed? I can get the avro schema out of the parquet files from the footers. My assumption was to use crunch's AvroParquetFileSupport https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/parquet/AvroParquetFileSource.java and simply pass the same type as the vanilla avro files use, but no matter what this always seems to load specifics, and my collections typed to generics except with a classcastexception. Is there a way to ensure that that the only instances ever created by the reader are GenericRecord instances? Please let me know if I can provide more detail. ~Mark
