Re: Running TPC-H on Pig

Renato Marroquín Mogrovejo Fri, 02 Dec 2011 14:00:10 -0800

My bad I was talking about TPC-DS (:
I used the TPC-DS to test Pig Joins, but I didn't actually think on
comparing it with Hive because Hive already has on going projects for
its cost based optimizer, and I thought it wouldn't be a fair
comparison. But I guess your work is related to the starfish system
right?
Anyways, I hope to see your benchmark.


Renato M.


2011/12/2 Jie Li <[email protected]>:
> TPC-E is for transaction, so why is it better for evaluating Hadoop related
> systems?
>
> We are benchmarking the whole queries. We found that some simple heuristics
> work very well so far. No doubt that the statistics would help make a even
> better query plan.
>
> Jie
>
> On Wed, Nov 30, 2011 at 12:18 AM, Renato Marroquín Mogrovejo <
> [email protected]> wrote:
>
>> Hey,
>>
>> why didn't you use the TPC-E?and what are you guys exactly
>> benchmarking?i.e. specific components of both systems or the whole queries?
>> Because hive is already able to use some basic statistics but pig isn't,and
>> at least until hcat is ready it won't be able to take fully advantage of
>> them.
>>
>> Renato M.
>> On Nov 29, 2011 8:18 PM, "Jonathan Coveney" <[email protected]> wrote:
>>
>> > If you want some feedback on the how to make the scripts faster, feel
>> free
>> > to post them.
>> >
>> > 2011/11/29 Jie Li <[email protected]>
>> >
>> > > Did you mean the two update functions of TPC-H? I think we can leave
>> them
>> > > out as Hive did, as usually Hadoop is not for update.
>> > >
>> > > Jie
>> > >
>> > > On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan <
>> [email protected]
>> > > >wrote:
>> > >
>> > > > Please do. The association with TPC-H might be tricky as it mandates
>> > the
>> > > > concurrent data modification. Nevertheless, the benchmark will be
>> very
>> > > > useful as you point out.
>> > > >
>> > > > -----Original Message-----
>> > > > From: Jie Li [mailto:[email protected]]
>> > > > Sent: Tuesday, November 29, 2011 11:38 AM
>> > > > To: [email protected]
>> > > > Subject: Running TPC-H on Pig
>> > > >
>> > > > Hello everyone,
>> > > >
>> > > > As people are usually more concerned about the performance, we need
>> > more
>> > > > benchmarks to identify the bottleneck of the Pig's performance. For a
>> > > class
>> > > > project we develop a whole set of Pig scripts for TPC-H. Though Pig
>> was
>> > > not
>> > > > designed for this RDBMS benchmark, it does support most of the
>> relation
>> > > > operators like join and aggregation, which can be optimized based on
>> > this
>> > > > benchmark. Besides that, we can also demonstrate how to write
>> efficient
>> > > pig
>> > > > scripts by making full use of Pig Latin's features.
>> > > >
>> > > > Here are what we did:
>> > > > 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
>> > > > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator
>> > to
>> > > > implement join.
>> > > > 3) show how to optimize the join by slightly reordering or using
>> > > > replicated join. We think pig should be able to have more heuristic
>> > > > optimization for the join, such as evaluating the smaller join first,
>> > > using
>> > > > replicated join for small tables, and putting the larger table on the
>> > > right
>> > > > side of the hash join.
>> > > > 4) identify the poor performance of aggregation. Pig doesn't yet
>> > support
>> > > > hash-based aggregation so it's extremely slow for aggregation. Good
>> to
>> > > know
>> > > > that Pig is just about to support it:)
>> > > >
>> > > > As TPC-H is well-known, a good benchmark result can help change
>> > people's
>> > > > impression that Pig is slow. Actually we compare Pig and Hive and
>> find
>> > > that
>> > > > Pig is not necessarily slower than Hive. I wonder if we can create a
>> > jira
>> > > > for this project.
>> > > >
>> > > > Thanks,
>> > > > Jie Li
>> > > > PhD Candidate of Computer Science
>> > > > Duke University
>> > > >
>> > > >
>> > >
>> >
>>
>

Re: Running TPC-H on Pig

Reply via email to