Re: Avoiding serialization/de-serialization in pig

Jeff Zhang Wed, 30 Jun 2010 00:27:02 -0700

Agree, I have compared the performance between Hive and Pig using some
simple script.  The performance of Hive is much better than Pig. The mapper
task time of pig and hive is almost the same, The time difference is almost
is caused by the reduce task and much time is spent on transfer time from
mapper to reducer. This is because the Pig will transfer much more data than
Hive. Hive use another binary format
(Hive-640<https://issues.apache.org/jira/browse/HIVE-640>)
which can reduce the intermediate data between mapper and reducer. And Avro
is something very similar to this, it's more compact. I believe it will
improve Pig's performance.




On Tue, Jun 29, 2010 at 8:51 AM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:
> For what it's worth, I saw very significant speed improvements (order of
> magnitude for wide tables with few projected columns) when I implemented
(2)
> for our protocol buffer - based loaders.
>
> I have a feeling that propagating schemas when known, and using them to
for
> (de)serialization instead of reflecting every field, would also be a big
> win.
>
> Thoughts on just using Avro for the internal PigStorage?
>
> -D
>
> On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair <te...@yahoo-inc.com> wrote:
>
>> I have created a wiki which puts together some ideas that can help in
>> improving performance by avoiding/delaying serialization/de-serialization
.
>>
>> http://wiki.apache.org/pig/AvoidingSedes
>>
>> These are ideas that don't involve changes to optimizer. Most of them
>> involve changes in the load/store functions.
>>
>> Your feedback is welcome.
>>
>> Thanks,
>> Thejas
>>
>>
>



-- 
Best Regards

Jeff Zhang

Re: Avoiding serialization/de-serialization in pig

Reply via email to