[jira] [Updated] (CRUNCH-480) AvroParquetFileSource doesn't properly configure user-supplied read schema

Gabriel Reid (JIRA) Tue, 11 Nov 2014 05:48:15 -0800

     [ 
https://issues.apache.org/jira/browse/CRUNCH-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gabriel Reid updated CRUNCH-480:
--------------------------------
    Attachment: CRUNCH-480.1.patch

I think you've got me convinced [~jwills]. I was actually finally taking a 
closer look at this too, and had put together some integration tests which I've 
added to your changes in the attached patch.

I think it's still probably necessary to do something with the builder 
AvroParquetFileSource.Builder class to make the setting of a reader schema more 
clear. As it currently stands, doing something like this:
{code}
AvroParquetFileSource.builder(readSchemaWithSupersetOfFields).build()
{code}
will create an AvroParquetFileSource instance that uses the same schema for the 
parquet projection and Avro reading. This seems to work ok, except for the fact 
that default handling doesn't work completely correctly within parquet when you 
do this, and seeing as default handling is a basic requirement for using a 
custom read schema, that's an issue. If you do specify a subset of the writer 
fields to the builder (i.e. build a projection schema) that is a (proper or not 
proper) subset of the writer schema, then everything seems to work fine.

Maybe supplying a custom read schema with Parquet is enough of a non-default 
option that it can just be made clear that you need to use the constructor (and 
not the builder) if you want to supply a custom reader schema. I'm not sure, 
but it seems difficult to fit in the ability to specify a different reader 
schema with the builder as-is without making it's API overly complicated.

> AvroParquetFileSource doesn't properly configure user-supplied read schema
> --------------------------------------------------------------------------
>
>                 Key: CRUNCH-480
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-480
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.10.0
>            Reporter: E. Sammer
>            Assignee: Gabriel Reid
>            Priority: Blocker
>         Attachments: CRUNCH-480.1.patch, CRUNCH-480.patch
>
>
> It seems like AvroParquetFileSource doesn't properly set the configuration 
> param required to use a user-supplied read schema that differs from the 
> schema in the file.
> Deep in the guts of Parquet (InternalParquetReader#initialize()), I found 
> this:
> {code}
>    this.recordConverter = readSupport.prepareForRead(
>         configuration, extraMetadata, fileSchema,
>         new ReadSupport.ReadContext(requestedSchema, readSupportMetadata));
> {code}
> Later, in Parquet's AvroReadSupport#prepareForRead(), it appears to ignore 
> the supplied requestedSchema and, instead, looks for the key avro.read.schema 
> in the readSupportMetadata map. This is seriously kookie code in Parquet 
> (i.e. wrong), but because Crunch doesn't supply readSupportMetadata, we can 
> never properly supply a read schema. Boooo hisssss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CRUNCH-480) AvroParquetFileSource doesn't properly configure user-supplied read schema

Reply via email to