Re: Avoiding serialization/de-serialization in pig

Thejas Nair Wed, 30 Jun 2010 09:44:22 -0700

On 6/28/10 5:51 PM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote:

> 
> I have a feeling that propagating schemas when known, and using them to for
> (de)serialization instead of reflecting every field, would also be a big
> win.
> 
> Thoughts on just using Avro for the internal PigStorage?

When I profiled pig queries, I don't see much time being spent in
DataType.findType(Object o), where the type of object is determined using
"instanceof". (I am assuming you were referring to that).

But we can still optimize the cases where schema is known (ie all rows have
same schema) by not storing the type with each field in the serialization
format . Avro stores the schema separately, so I assume it has this
optimization. But in the case where schema is not known, we would need to
store the type information for every row.
When query plan is generated, we would need to determine which serialization
format is to be used.

-Thejas

Re: Avoiding serialization/de-serialization in pig

Reply via email to