Thanks Thejas for the comments! See my answers inline. > 1. Order-by > The comparison against hive order-by is misleading. Hive does not do total > ordering, unless you use a single reducer. > But yes, in case of pig, the sampling phase is unnecessary, if you use a > single reducer. A single reducer can make sense if the data you are sorting > is small. I agree that it makes sense to remove the sampling phase in pig in > such cases.
Yes the environment set up uses only 1GB data, so there is only 1 reducer for the order-by. I've also updated the doc that Hive always uses 1 reducer for the order-by. I'll also make sure Pig/Hive use same number of maps/reduces if possible and update the doc. > 2. Lazy type conversion > Can you add a note about how many records are there in input vs output ? > In this example, we can improve by using the logical optimizer, so only > necessary parts are typecast before the filter. > I've purposely filtered out all the input records. From the logical plan, the filter is not pushed above the foreach, which can be a separate issue that need investigating. Therefore, each record is fully deserialized and then thrown away. > One problem in pig is that it uses java objects like Integer, String etc > which are final types. Which means that we can't create a subclass by that > delays the conversion until it actually gets used. The types are part of > the udf interface. We should consider if we want to do something like this, > when we add new udf interfaces. > > Some thoughts on serialization/deserialization improvements that i had > written earlier - http://wiki.apache.org/pig/AvoidingSedes > Thanks for sharing these thoughts! I'll incorporate it into the doc and discuss more details later. Jie > Thanks, > Thejas > > > > > > > > On 6/21/12 11:14 AM, Jie Li wrote: >> >> Hello everyone, >> >> I compiled a list of possible optimizaiton for Pig's performance. >> >> >> https://cwiki.apache.org/confluence/display/PIG/Pig+Performance+Optimization >> >> As I haven't been very familiar with the codebase, I'm likely to >> underestimate the complexity involved, so any input will be >> appreciated. >> >> Thanks, >> Jie > >
