On 12/04/2014 10:28 AM, Yan Qi wrote:
Hi rb,

Thanks for your quick reply!

I first set the read schema,
AvroParquetInputFormat.setAvroReadSchema(job, Profile.getClassSchema());

Then I define a request schema which is a subset of
Profile.getClassSchema() and set the projection:
AvroParquetInputFormat.setRequestedProjection(job, requestSchema);

Is there any problem with this? Or is there anything else I missed?

Thanks,
Yan

How many fields are there in the iR record in your read schema? I think the problem is that you're getting defaults for the columns you're removing with the projected schema. So if you have 13 columns in the read schema but you're only loading 3 of them from the file, then you're defaulting 10 columns and that might cause a slow-down.

An easy way to check is to see how many columns in iR your read schema has and how many you're actually loading from the file. Then try filtering your read schema as well so you don't have as many and see if that helps performance.

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to