Agree, I have compared the performance between Hive and Pig using some simple script. The performance of Hive is much better than Pig. The mapper task time of pig and hive is almost the same, The time difference is almost is caused by the reduce task and much time is spent on transfer time from mapper to reducer. This is because the Pig will transfer much more data than Hive. Hive use another binary format (Hive-640<https://issues.apache.org/jira/browse/HIVE-640>) which can reduce the intermediate data between mapper and reducer. And Avro is something very similar to this, it's more compact. I believe it will improve Pig's performance.
On Tue, Jun 29, 2010 at 8:51 AM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote: > For what it's worth, I saw very significant speed improvements (order of > magnitude for wide tables with few projected columns) when I implemented (2) > for our protocol buffer - based loaders. > > I have a feeling that propagating schemas when known, and using them to for > (de)serialization instead of reflecting every field, would also be a big > win. > > Thoughts on just using Avro for the internal PigStorage? > > -D > > On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair <te...@yahoo-inc.com> wrote: > >> I have created a wiki which puts together some ideas that can help in >> improving performance by avoiding/delaying serialization/de-serialization . >> >> http://wiki.apache.org/pig/AvoidingSedes >> >> These are ideas that don't involve changes to optimizer. Most of them >> involve changes in the load/store functions. >> >> Your feedback is welcome. >> >> Thanks, >> Thejas >> >> > -- Best Regards Jeff Zhang