[ https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200332#comment-14200332 ]
Gabriel Reid commented on CRUNCH-480: ------------------------------------- It looks like the situation in [AvroReadSupport in Parquet|https://github.com/apache/incubator-parquet-mr/blob/0148455170be07f89bd6b9230960a6cd510c7ca6/parquet-avro/src/main/java/parquet/avro/AvroReadSupport.java#L54-L87] has been cleaned up quite a bit in the mean time. If upgrading to Parquet 1.4.x is an option, a short-term workaround that I tried out and that seems to work is as follows: you pass the read schema to the constructor of AvroParquetFileSource, and then you make this call: {code} AvroReadSupport.setAvroReadSchema( pipeline.getConfiguration(), readSchema); {code} Unfortunately, that sets that read schema globally for the pipeline, so if you're reading multiple Parquet sources within the one pipeline that'll be a problem. As far as the structural fix, I think the following should do it: * upgrade to Parquet 1.4.x or later * do the equivalent of {{AvroReadSupport.setAvroReadSchema}} in {{AvroParquetFileSource#getBundle}} based on the schema that is passed in to the AvroParquetFileSource constructor (if there is one) Does that sound right to you [~tomwhite]? Or are there other nuances to projection and/or read schemas in Parquet that I'm missing? > AvroParquetFileSource doesn't properly configure user-supplied read schema > -------------------------------------------------------------------------- > > Key: CRUNCH-480 > URL: https://issues.apache.org/jira/browse/CRUNCH-480 > Project: Crunch > Issue Type: Bug > Components: IO > Affects Versions: 0.10.0 > Reporter: E. Sammer > Priority: Blocker > > It seems like AvroParquetFileSource doesn't properly set the configuration > param required to use a user-supplied read schema that differs from the > schema in the file. > Deep in the guts of Parquet (InternalParquetReader#initialize()), I found > this: > {code} > this.recordConverter = readSupport.prepareForRead( > configuration, extraMetadata, fileSchema, > new ReadSupport.ReadContext(requestedSchema, readSupportMetadata)); > {code} > Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore > the supplied requestedSchema and, instead, looks for the key avro.read.schema > in the readSupportMetadata map. This is seriously kookie code in Parquet > (i.e. wrong), but because Crunch doesn't supply readSupportMetadata, we can > never properly supply a read schema. Boooo hisssss. -- This message was sent by Atlassian JIRA (v6.3.4#6332)