[ 
https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202113#comment-14202113
 ] 

Tom White commented on CRUNCH-480:
----------------------------------

We hashed out the differences between projection and read schemas on 
https://github.com/Parquet/parquet-mr/pull/246, and came to the conclusion that 
they are orthogonal. Read schemas are for schema evolution in the usual Avro 
fashion, whereas projection schemas are just a convenient way to select a 
subset of the columns that you want to read.

I think the change to AvroParquetFileSource that is needed is adding a 
constructor that takes both a projection schema and a read schema. The existing 
constructors that take a schema are both for projection schemas and those can't 
be changed (e.g. to read schemas) for compatibility reasons.

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>
>                 Key: CRUNCH-480
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-480
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Assignee: Gabriel Reid
>            Priority: Blocker
>
> It seems like AvroParquetFileSource doesn't properly set the configuration 
> param required to use a user-supplied read schema that differs from the 
> schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found 
> this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore 
> the supplied requestedSchema and, instead, looks for the key avro.read.schema 
> in the readSupportMetadata map. This is seriously kookie code in Parquet 
> (i.e. wrong), but because Crunch doesn't supply readSupportMetadata, we can 
> never properly supply a read schema. Boooo hisssss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to