[ 
https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208517#comment-14208517
 ] 

Gabriel Reid commented on CRUNCH-480:
-------------------------------------

[~jwills] that looks good to me. I think that the constructor issue is now a 
non-issue since the last patch that I posted, as the projection schema is now 
only set if it has been explicitly set in the builder. I believe the situation 
is now the following:
* the avro "writer" (i.e. file) schema is taken from the parquet file
* the avro "reader" schema is taken from the PType or supplied schema in the 
builder
* the parquet projection is by default null (which means that it is the same as 
the writer schema), but can be supplied by the builder or AvroParquetFileSource 
constructor

The issue that I was referring to previously, where the defaults would not get 
filled in if you supplied a reader schema that was different than the file 
schema but didn't supply a projection schema, is no longer an issue, and there 
is a test my the last patch(es) that demonstrate this. I think this is ready to 
go as-is.

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>
>                 Key: CRUNCH-480
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-480
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Assignee: Gabriel Reid
>            Priority: Blocker
>         Attachments: CRUNCH-480.1.patch, CRUNCH-480.2.patch, 
> CRUNCH-480.3.patch, CRUNCH-480.patch
>
>
> It seems like AvroParquetFileSource doesn't properly set the configuration 
> param required to use a user-supplied read schema that differs from the 
> schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found 
> this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore 
> the supplied requestedSchema and, instead, looks for the key avro.read.schema 
> in the readSupportMetadata map. This is seriously kookie code in Parquet 
> (i.e. wrong), but because Crunch doesn't supply readSupportMetadata, we can 
> never properly supply a read schema. Boooo hisssss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to