[ https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Wills updated CRUNCH-480: ------------------------------ Attachment: CRUNCH-480.patch It's not clear to me that a new constructor is needed vs. using the existing constructors as intended, viz., the AvroType indicates the Schema that should be read (and is configured as such), and the additional projected schema (if given) is used to project a subset of the columns. [~tomwhite], how does this patch look to you? > AvroParquetFileSource doesn't properly configure user-supplied read schema > -------------------------------------------------------------------------- > > Key: CRUNCH-480 > URL: https://issues.apache.org/jira/browse/CRUNCH-480 > Project: Crunch > Issue Type: Bug > Components: IO > Affects Versions: 0.10.0 > Reporter: E. Sammer > Assignee: Gabriel Reid > Priority: Blocker > Attachments: CRUNCH-480.patch > > > It seems like AvroParquetFileSource doesn't properly set the configuration > param required to use a user-supplied read schema that differs from the > schema in the file. > Deep in the guts of Parquet (InternalParquetReader#initialize()), I found > this: > {code} > this.recordConverter = readSupport.prepareForRead( > configuration, extraMetadata, fileSchema, > new ReadSupport.ReadContext(requestedSchema, readSupportMetadata)); > {code} > Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore > the supplied requestedSchema and, instead, looks for the key avro.read.schema > in the readSupportMetadata map. This is seriously kookie code in Parquet > (i.e. wrong), but because Crunch doesn't supply readSupportMetadata, we can > never properly supply a read schema. Boooo hisssss. -- This message was sent by Atlassian JIRA (v6.3.4#6332)