Yeah we already have some results but not so good, so we are currently rewriting some of the scripts especially rewriting the joins. Once we can a good result we will publish it.
Jie On Tue, Nov 29, 2011 at 2:41 PM, Jonathan Coveney <[email protected]>wrote: > I'm a little confused. Do you already have the benchmarks? I'd love to see > them if you do. Do you want to make a JIRA in order to put this info on the > site? I'm a little confused, but I agree that statistics can help focus > effort and could also be a good tool for evangelism (especially if Pig is > in fact as fast as Hive in cases). > > 2011/11/29 Jie Li <[email protected]> > > > Hello everyone, > > > > As people are usually more concerned about the performance, we need more > > benchmarks to identify the bottleneck of the Pig's performance. For a > class > > project we develop a whole set of Pig scripts for TPC-H. Though Pig was > not > > designed for this RDBMS benchmark, it does support most of the relation > > operators like join and aggregation, which can be optimized based on this > > benchmark. Besides that, we can also demonstrate how to write efficient > pig > > scripts by making full use of Pig Latin's features. > > > > Here are what we did: > > 1) write correct pig scripts for TPC-H by verifying them on 1GB data. > > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator to > > implement join. > > 3) show how to optimize the join by slightly reordering or using > replicated > > join. We think pig should be able to have more heuristic optimization for > > the join, such as evaluating the smaller join first, using replicated > join > > for small tables, and putting the larger table on the right side of the > > hash join. > > 4) identify the poor performance of aggregation. Pig doesn't yet support > > hash-based aggregation so it's extremely slow for aggregation. Good to > know > > that Pig is just about to support it:) > > > > As TPC-H is well-known, a good benchmark result can help change people's > > impression that Pig is slow. Actually we compare Pig and Hive and find > that > > Pig is not necessarily slower than Hive. I wonder if we can create a jira > > for this project. > > > > Thanks, > > Jie Li > > PhD Candidate of Computer Science > > Duke University > > >
