Yeah sure. We are just about to post them. Jie
On Tue, Nov 29, 2011 at 8:18 PM, Jonathan Coveney <[email protected]>wrote: > If you want some feedback on the how to make the scripts faster, feel free > to post them. > > 2011/11/29 Jie Li <[email protected]> > > > Did you mean the two update functions of TPC-H? I think we can leave them > > out as Hive did, as usually Hadoop is not for update. > > > > Jie > > > > On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan <[email protected] > > >wrote: > > > > > Please do. The association with TPC-H might be tricky as it mandates > the > > > concurrent data modification. Nevertheless, the benchmark will be very > > > useful as you point out. > > > > > > -----Original Message----- > > > From: Jie Li [mailto:[email protected]] > > > Sent: Tuesday, November 29, 2011 11:38 AM > > > To: [email protected] > > > Subject: Running TPC-H on Pig > > > > > > Hello everyone, > > > > > > As people are usually more concerned about the performance, we need > more > > > benchmarks to identify the bottleneck of the Pig's performance. For a > > class > > > project we develop a whole set of Pig scripts for TPC-H. Though Pig was > > not > > > designed for this RDBMS benchmark, it does support most of the relation > > > operators like join and aggregation, which can be optimized based on > this > > > benchmark. Besides that, we can also demonstrate how to write efficient > > pig > > > scripts by making full use of Pig Latin's features. > > > > > > Here are what we did: > > > 1) write correct pig scripts for TPC-H by verifying them on 1GB data. > > > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator > to > > > implement join. > > > 3) show how to optimize the join by slightly reordering or using > > > replicated join. We think pig should be able to have more heuristic > > > optimization for the join, such as evaluating the smaller join first, > > using > > > replicated join for small tables, and putting the larger table on the > > right > > > side of the hash join. > > > 4) identify the poor performance of aggregation. Pig doesn't yet > support > > > hash-based aggregation so it's extremely slow for aggregation. Good to > > know > > > that Pig is just about to support it:) > > > > > > As TPC-H is well-known, a good benchmark result can help change > people's > > > impression that Pig is slow. Actually we compare Pig and Hive and find > > that > > > Pig is not necessarily slower than Hive. I wonder if we can create a > jira > > > for this project. > > > > > > Thanks, > > > Jie Li > > > PhD Candidate of Computer Science > > > Duke University > > > > > > > > >
