[ https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200632#comment-14200632 ]
Gabriel Reid commented on CRUNCH-480: ------------------------------------- And now having thought about this a bit more, I see that I was over-simplifying things a bit with my proposed fix of just doing the equivalent of {{AvroReadSupport.setAvroReadSchema}} when a custom schema is provided, as this means that a projection schema always means that a custom read schema is used, and vice versa. I guess the situations that need to be supported are: * no projection and use the write schema for reading * use projection, but use the write schema for reading (which means some fields will just be null) * use projection and a custom read schema I'm not clear if a custom read schema without a projection is something that would be needed. [~esammer], could you elaborate on your use case? I'm guessing that using a projection > AvroParquetFileSource doesn't properly configure user-supplied read schema > -------------------------------------------------------------------------- > > Key: CRUNCH-480 > URL: https://issues.apache.org/jira/browse/CRUNCH-480 > Project: Crunch > Issue Type: Bug > Components: IO > Affects Versions: 0.10.0 > Reporter: E. Sammer > Assignee: Gabriel Reid > Priority: Blocker > > It seems like AvroParquetFileSource doesn't properly set the configuration > param required to use a user-supplied read schema that differs from the > schema in the file. > Deep in the guts of Parquet (InternalParquetReader#initialize()), I found > this: > {code} > this.recordConverter = readSupport.prepareForRead( > configuration, extraMetadata, fileSchema, > new ReadSupport.ReadContext(requestedSchema, readSupportMetadata)); > {code} > Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore > the supplied requestedSchema and, instead, looks for the key avro.read.schema > in the readSupportMetadata map. This is seriously kookie code in Parquet > (i.e. wrong), but because Crunch doesn't supply readSupportMetadata, we can > never properly supply a read schema. Boooo hisssss. -- This message was sent by Atlassian JIRA (v6.3.4#6332)