[jira] [Commented] (CRUNCH-480) AvroParquetFileSource doesn't properly configure user-supplied read schema

Gabriel Reid (JIRA) Tue, 11 Nov 2014 11:06:13 -0800

    [ 
https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206836#comment-14206836
 ]


Gabriel Reid commented on CRUNCH-480:
-------------------------------------

[~tomwhite] - I wouldn't really call it a bug in the default handling in 
Parquet, maybe more the consequence of misuse of the API, but the situation is 
as follows: if you have Parquet files written according to Schema v1, and then 
create Schema v2 with a new field that has a default value, and try to read the 
original file using Schema v2, the defaults won't get filled in during reading.

This happens because the reader schema (Schema v2) is then also used as the 
Parquet projection schema. In the constructor of 
parquet.avro.AvroIndexedRecordConverter, the default value handling is based on 
the difference between the projection schema and the reader schema, and because 
in this case these are both the same schema, no default value handling is done 
at all.

...and while writing this and then going back over your comments and seeing the 
question of why getBundle sets a projection schema even if one isn't set up, it 
seems that by removing this maybe everything (or almost everything) gets fixed. 
In any case, the situation I just described is fixed by conditionally adding 
the projection schema. I'll upload the patch that does that in just a moment.

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>
>                 Key: CRUNCH-480
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-480
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Assignee: Gabriel Reid
>            Priority: Blocker
>         Attachments: CRUNCH-480.1.patch, CRUNCH-480.patch
>
>
> It seems like AvroParquetFileSource doesn't properly set the configuration 
> param required to use a user-supplied read schema that differs from the 
> schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found 
> this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore 
> the supplied requestedSchema and, instead, looks for the key avro.read.schema 
> in the readSupportMetadata map. This is seriously kookie code in Parquet 
> (i.e. wrong), but because Crunch doesn't supply readSupportMetadata, we can 
> never properly supply a read schema. Boooo hisssss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CRUNCH-480) AvroParquetFileSource doesn't properly configure user-supplied read schema

Reply via email to