For what it's worth, I saw very significant speed improvements (order of magnitude for wide tables with few projected columns) when I implemented (2) for our protocol buffer - based loaders.
I have a feeling that propagating schemas when known, and using them to for (de)serialization instead of reflecting every field, would also be a big win. Thoughts on just using Avro for the internal PigStorage? -D On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair <te...@yahoo-inc.com> wrote: > I have created a wiki which puts together some ideas that can help in > improving performance by avoiding/delaying serialization/de-serialization . > > http://wiki.apache.org/pig/AvoidingSedes > > These are ideas that don't involve changes to optimizer. Most of them > involve changes in the load/store functions. > > Your feedback is welcome. > > Thanks, > Thejas > >