Hello all!

At Spotify we have some Avro datasets stored in Parquet in our hdfs
cluster.  As part of our internal data processing framework built atop
Apache Crunch, we generally read and transform these records using typed
pipelines + generated instances of SpecificRecords from a collective schema
repository.  The assumption is that in most cases you'd prefer working with
a UserSomethingRecord with IDE code completion rather than a GenericRecord
with .get calls.

There are however some instances where we'd like to use GenericRecords,
especially in infrastructure teams migrating data formats or writing out
plain json.  These are use cases where we don't care who authored what, we
just probably want to do some recursive processing on all the fields and
ignore the specific types completely.

Parquet version in question here was 1.6.0rc3, which I've now bumped to
1.6.0 after seeing https://issues.apache.org/jira/browse/PARQUET-140.

My question is if it is at all possible to coerce the avro representation
loaded from our parquet files to use GenericRecord everywhere, even in
nested fields with full namespaces to schema classes, and even when such a
class can be located by the classloader.  When using vanilla avro files, we
get this functionality from crunch in the form of the Avros.generics ptype,
which takes a schema and types the generated collection to
GenericData.Record instances.

I noticed the 1.6.0 official release added the AvroDataSupplier
interface... is there a possible implementation of this that could behave
as needed?

I can get the avro schema out of the parquet files from the footers.  My
assumption was to use crunch's AvroParquetFileSupport
https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/parquet/AvroParquetFileSource.java
and simply pass the same type as the vanilla avro files use, but no matter
what this always seems to load specifics, and my collections typed to
generics except with a classcastexception.

Is there a way to ensure that that the only instances ever created by the
reader are GenericRecord instances?  Please let me know if I can provide
more detail.

~Mark

Reply via email to