Jian, If what you are looking for is something that will let you deal with skewed data and forget about how the underlying distributed system works, both Pig and Hive will help you do that to some extent. If you are looking for something that will let you exercise fine-grained control over individual scheduling of tasks, which is what this sounds like, neither project is for you -- in fact, this is more or less the opposite of what they are trying to do, which is to take away the complexities of partitioning large data sets, scheduling tasks, and orchestrating data flows.
If you are looking to tweak the hadoop internals to schedule things differently, you may find that the pluggable scheduler interface is useful. If you manage to achieve your goals by constructing a new scheduler, Pig and Hive will both continue working as higher-level abstractions, as long as you adhere to the provided interface for task scheduling. On Mon, Feb 8, 2010 at 2:05 AM, jian yi <[email protected]> wrote: > We can regards a task as a sleep call, the parameter of sleep is the time > long. > sleep(N) - For hive ,the N is not certain > sleep(M) - For MBR, the M is certain > > 2010/2/8 jian yi <[email protected]> > >> Hi Jeff, >> >> Thank you Jeff. >> I known Hive has handling skewed join, but I think it is not enough: >> 1.Need cost sample >> 2.Can't control the size of a task >> 3.Not exact >> 4.Must use Hive or Pig >> >> I think this is a fundamental solution for skew problem by adding balacne >> between map and reduce. Maybe I need express it more detailed. >> >> Regards >> Jian YI >> >> 2010/2/8 Jeff Hammerbacher <[email protected]> >> >> Hey Jian, >>> >>> Hive supports arbitrary procedural languages through Hadoop Streaming; see >>> http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform for more. >>> >>> Also, both Hive and Pig have support for handling skewed joins if you use >>> their higher-level interface. See >>> https://issues.apache.org/jira/browse/HIVE-562 and >>> http://wiki.apache.org/pig/PigSkewedJoinSpec. >>> >>> Thanks, >>> Jeff >>> >>> On Sun, Feb 7, 2010 at 4:13 AM, jian yi <[email protected]> wrote: >>> >>> > Hey Jeff, >>> > >>> > Thank you, Jeff. >>> > The procedure means procedure language, like Oracle PL/SQL, which is >>> very >>> > helpful to migrate old services. We want to build a data warehouse based >>> on >>> > MapReduce engine. I plan to optimize MapReduce to solve the skew problem >>> by >>> > adding a balance between map and reduce. Please refer to >>> > http://bbs.hadoopor.com/thread-521-1-1.html >>> > >>> > <http://bbs.hadoopor.com/thread-521-1-1.html>Regards, >>> > Jian >>> > >>> > 2010/2/7 Jeff Hammerbacher <[email protected]> >>> > >>> > > Hey Jian, >>> > > >>> > > I'm not sure what you mean by "Hive don't support procedure", but in >>> any >>> > > case, the Pig team has stated that they will support SQL over the Pig >>> > > execution engine. See https://issues.apache.org/jira/browse/PIG-824. >>> > > >>> > > Regards, >>> > > Jeff >>> > > >>> > > On Sat, Feb 6, 2010 at 6:16 PM, jian yi <[email protected]> wrote: >>> > > >>> > > > Hi, >>> > > > >>> > > > SQL is very helpful to develop data warehouse, but Hive don't >>> support >>> > > > procedure. if Pig support SQL, it will be more powerful. >>> > > > >>> > > >>> > >>> >> >> >
