Re: Running TPC-H on Pig

Jonathan Coveney Tue, 29 Nov 2011 17:18:58 -0800

If you want some feedback on the how to make the scripts faster, feel free
to post them.


2011/11/29 Jie Li <[email protected]>

> Did you mean the two update functions of TPC-H? I think we can leave them
> out as Hive did, as usually Hadoop is not for update.
>
> Jie
>
> On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan <[email protected]
> >wrote:
>
> > Please do. The association with TPC-H might be tricky as it mandates the
> > concurrent data modification. Nevertheless, the benchmark will be very
> > useful as you point out.
> >
> > -----Original Message-----
> > From: Jie Li [mailto:[email protected]]
> > Sent: Tuesday, November 29, 2011 11:38 AM
> > To: [email protected]
> > Subject: Running TPC-H on Pig
> >
> > Hello everyone,
> >
> > As people are usually more concerned about the performance, we need more
> > benchmarks to identify the bottleneck of the Pig's performance. For a
> class
> > project we develop a whole set of Pig scripts for TPC-H. Though Pig was
> not
> > designed for this RDBMS benchmark, it does support most of the relation
> > operators like join and aggregation, which can be optimized based on this
> > benchmark. Besides that, we can also demonstrate how to write efficient
> pig
> > scripts by making full use of Pig Latin's features.
> >
> > Here are what we did:
> > 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator to
> > implement join.
> > 3) show how to optimize the join by slightly reordering or using
> > replicated join. We think pig should be able to have more heuristic
> > optimization for the join, such as evaluating the smaller join first,
> using
> > replicated join for small tables, and putting the larger table on the
> right
> > side of the hash join.
> > 4) identify the poor performance of aggregation. Pig doesn't yet support
> > hash-based aggregation so it's extremely slow for aggregation. Good to
> know
> > that Pig is just about to support it:)
> >
> > As TPC-H is well-known, a good benchmark result can help change people's
> > impression that Pig is slow. Actually we compare Pig and Hive and find
> that
> > Pig is not necessarily slower than Hive. I wonder if we can create a jira
> > for this project.
> >
> > Thanks,
> > Jie Li
> > PhD Candidate of Computer Science
> > Duke University
> >
> >
>

Re: Running TPC-H on Pig

Reply via email to