[ 
https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200332#comment-14200332
 ] 

Gabriel Reid commented on CRUNCH-480:
-------------------------------------

It looks like the situation in [AvroReadSupport in 
Parquet|https://github.com/apache/incubator-parquet-mr/blob/0148455170be07f89bd6b9230960a6cd510c7ca6/parquet-avro/src/main/java/parquet/avro/AvroReadSupport.java#L54-L87]
 has been cleaned up quite a bit in the mean time. 

If upgrading to Parquet 1.4.x is an option, a short-term workaround that I 
tried out and that seems to work is as follows: you pass the read schema to the 
constructor of AvroParquetFileSource, and then you make this call:
{code}
AvroReadSupport.setAvroReadSchema(
        pipeline.getConfiguration(),
        readSchema);
{code}

Unfortunately, that sets that read schema globally for the pipeline, so if 
you're reading multiple Parquet sources within the one pipeline that'll be a 
problem.

As far as the structural fix, I think the following should do it:
* upgrade to Parquet 1.4.x or later
* do the equivalent of {{AvroReadSupport.setAvroReadSchema}} in 
{{AvroParquetFileSource#getBundle}} based on the schema that is passed in to 
the AvroParquetFileSource constructor (if there is one)

Does that sound right to you [~tomwhite]? Or are there other nuances to 
projection and/or read schemas in Parquet that I'm missing?

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>
>                 Key: CRUNCH-480
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-480
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Priority: Blocker
>
> It seems like AvroParquetFileSource doesn't properly set the configuration 
> param required to use a user-supplied read schema that differs from the 
> schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found 
> this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore 
> the supplied requestedSchema and, instead, looks for the key avro.read.schema 
> in the readSupportMetadata map. This is seriously kookie code in Parquet 
> (i.e. wrong), but because Crunch doesn't supply readSupportMetadata, we can 
> never properly supply a read schema. Boooo hisssss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to