My bad I was talking about TPC-DS (: I used the TPC-DS to test Pig Joins, but I didn't actually think on comparing it with Hive because Hive already has on going projects for its cost based optimizer, and I thought it wouldn't be a fair comparison. But I guess your work is related to the starfish system right? Anyways, I hope to see your benchmark.
Renato M. 2011/12/2 Jie Li <[email protected]>: > TPC-E is for transaction, so why is it better for evaluating Hadoop related > systems? > > We are benchmarking the whole queries. We found that some simple heuristics > work very well so far. No doubt that the statistics would help make a even > better query plan. > > Jie > > On Wed, Nov 30, 2011 at 12:18 AM, Renato Marroquín Mogrovejo < > [email protected]> wrote: > >> Hey, >> >> why didn't you use the TPC-E?and what are you guys exactly >> benchmarking?i.e. specific components of both systems or the whole queries? >> Because hive is already able to use some basic statistics but pig isn't,and >> at least until hcat is ready it won't be able to take fully advantage of >> them. >> >> Renato M. >> On Nov 29, 2011 8:18 PM, "Jonathan Coveney" <[email protected]> wrote: >> >> > If you want some feedback on the how to make the scripts faster, feel >> free >> > to post them. >> > >> > 2011/11/29 Jie Li <[email protected]> >> > >> > > Did you mean the two update functions of TPC-H? I think we can leave >> them >> > > out as Hive did, as usually Hadoop is not for update. >> > > >> > > Jie >> > > >> > > On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan < >> [email protected] >> > > >wrote: >> > > >> > > > Please do. The association with TPC-H might be tricky as it mandates >> > the >> > > > concurrent data modification. Nevertheless, the benchmark will be >> very >> > > > useful as you point out. >> > > > >> > > > -----Original Message----- >> > > > From: Jie Li [mailto:[email protected]] >> > > > Sent: Tuesday, November 29, 2011 11:38 AM >> > > > To: [email protected] >> > > > Subject: Running TPC-H on Pig >> > > > >> > > > Hello everyone, >> > > > >> > > > As people are usually more concerned about the performance, we need >> > more >> > > > benchmarks to identify the bottleneck of the Pig's performance. For a >> > > class >> > > > project we develop a whole set of Pig scripts for TPC-H. Though Pig >> was >> > > not >> > > > designed for this RDBMS benchmark, it does support most of the >> relation >> > > > operators like join and aggregation, which can be optimized based on >> > this >> > > > benchmark. Besides that, we can also demonstrate how to write >> efficient >> > > pig >> > > > scripts by making full use of Pig Latin's features. >> > > > >> > > > Here are what we did: >> > > > 1) write correct pig scripts for TPC-H by verifying them on 1GB data. >> > > > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator >> > to >> > > > implement join. >> > > > 3) show how to optimize the join by slightly reordering or using >> > > > replicated join. We think pig should be able to have more heuristic >> > > > optimization for the join, such as evaluating the smaller join first, >> > > using >> > > > replicated join for small tables, and putting the larger table on the >> > > right >> > > > side of the hash join. >> > > > 4) identify the poor performance of aggregation. Pig doesn't yet >> > support >> > > > hash-based aggregation so it's extremely slow for aggregation. Good >> to >> > > know >> > > > that Pig is just about to support it:) >> > > > >> > > > As TPC-H is well-known, a good benchmark result can help change >> > people's >> > > > impression that Pig is slow. Actually we compare Pig and Hive and >> find >> > > that >> > > > Pig is not necessarily slower than Hive. I wonder if we can create a >> > jira >> > > > for this project. >> > > > >> > > > Thanks, >> > > > Jie Li >> > > > PhD Candidate of Computer Science >> > > > Duke University >> > > > >> > > > >> > > >> > >> >
