I'm a little confused. Do you already have the benchmarks? I'd love to see
them if you do. Do you want to make a JIRA in order to put this info on the
site? I'm a little confused, but I agree that statistics can help focus
effort and could also be a good tool for evangelism (especially if Pig is
in fact as fast as Hive in cases).

2011/11/29 Jie Li <[email protected]>

> Hello everyone,
>
> As people are usually more concerned about the performance, we need more
> benchmarks to identify the bottleneck of the Pig's performance. For a class
> project we develop a whole set of Pig scripts for TPC-H. Though Pig was not
> designed for this RDBMS benchmark, it does support most of the relation
> operators like join and aggregation, which can be optimized based on this
> benchmark. Besides that, we can also demonstrate how to write efficient pig
> scripts by making full use of Pig Latin's features.
>
> Here are what we did:
> 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> 2) demonstrate the flexibility of Pig Latin by using COGROUP operator to
> implement join.
> 3) show how to optimize the join by slightly reordering or using replicated
> join. We think pig should be able to have more heuristic optimization for
> the join, such as evaluating the smaller join first, using replicated join
> for small tables, and putting the larger table on the right side of the
> hash join.
> 4) identify the poor performance of aggregation. Pig doesn't yet support
> hash-based aggregation so it's extremely slow for aggregation. Good to know
> that Pig is just about to support it:)
>
> As TPC-H is well-known, a good benchmark result can help change people's
> impression that Pig is slow. Actually we compare Pig and Hive and find that
> Pig is not necessarily slower than Hive. I wonder if we can create a jira
> for this project.
>
> Thanks,
> Jie Li
> PhD Candidate of Computer Science
> Duke University
>

Reply via email to