[
https://issues.apache.org/jira/browse/PIG-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236573#comment-15236573
]
liyunzhang_intel commented on PIG-4797:
---------------------------------------
[~mohitsabharwal],[~pallavi.rao],[~kexianda]:
PIG-4797 collapses POLocalRearrange,POGlobalRearrange and POPackage to
POJoinSpark to reduce unnecessary
map operations to optimize join/group. Let's show the spark plan after
collapsing LR,GLR,PKG into POJoinSpark:
join.pig
{code}
daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray,
date:chararray, open:float, high:float, low:float,
close:float, volume:int, adj_close:float);
divs = load './NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
jnd = join daily by (exchange, symbol), divs by (exchange, symbol);
store jnd into './join.out';
{code}
before optimization:
{code}
jnd:
Store(hdfs://zly1.sh.intel.com:8020/user/root/join.out:org.apache.pig.builtin.PigStorage)
- scope-58
|
|---jnd: New For Each(true,true)[tuple] - scope-57
| |
| Project[bag][1] - scope-55
| |
| Project[bag][2] - scope-56
|
|---jnd: Package(Packager)[tuple]{tuple} - scope-48
|
|---jnd: Global Rearrange[tuple] - scope-47
|
|---jnd: Local Rearrange[tuple]{tuple}(false) - scope-49
| | |
| | Project[chararray][0] - scope-50
| | |
| | Project[chararray][1] - scope-51
| |
| |---daily: New For
Each(false,false,false,false,false,false,false,false,false)[bag] - scope-28
| | |
| | Cast[chararray] - scope-2
| | |
| | |---Project[bytearray][0] - scope-1
| | |
| | Cast[chararray] - scope-5
| | |
| | |---Project[bytearray][1] - scope-4
| | |
| | Cast[chararray] - scope-8
| | |
| | |---Project[bytearray][2] - scope-7
| | |
| | Cast[float] - scope-11
| | |
| | |---Project[bytearray][3] - scope-10
| | |
| | Cast[float] - scope-14
| | |
| | |---Project[bytearray][4] - scope-13
| | |
| | Cast[float] - scope-17
| | |
| | |---Project[bytearray][5] - scope-16
| | |
| | Cast[float] - scope-20
| | |
| | |---Project[bytearray][6] - scope-19
| | |
| | Cast[int] - scope-23
| | |
| | |---Project[bytearray][7] - scope-22
| | |
| | Cast[float] - scope-26
| | |
| | |---Project[bytearray][8] - scope-25
| |
| |---daily:
Load(hdfs://zly1.sh.intel.com:8020/user/root/NYSE_daily:org.apache.pig.builtin.PigStorage)
- scope-0
|
|---jnd: Local Rearrange[tuple]{tuple}(false) - scope-52
| |
| Project[chararray][0] - scope-53
| |
| Project[chararray][1] - scope-54
|
|---divs: New For Each(false,false,false,false)[bag] - scope-42
| |
| Cast[chararray] - scope-31
| |
| |---Project[bytearray][0] - scope-30
| |
| Cast[chararray] - scope-34
| |
| |---Project[bytearray][1] - scope-33
| |
| Cast[chararray] - scope-37
| |
| |---Project[bytearray][2] - scope-36
| |
| Cast[float] - scope-40
| |
| |---Project[bytearray][3] - scope-39
|
|---divs:
Load(hdfs://zly1.sh.intel.com:8020/user/root/NYSE_dividends:org.apache.pig.builtin.PigStorage)
- scope-29----
{code}
After join optimization:
{code}
jnd:
Store(hdfs://zly1.sh.intel.com:8020/user/root/join.out:org.apache.pig.builtin.PigStorage)
- scope-58
|
|---jnd: New For Each(true,true)[tuple] - scope-57
| |
| Project[bag][1] - scope-55
| |
| Project[bag][2] - scope-56
|
|---POJoinSpark[tuple] - scope-47
|
|---daily: New For
Each(false,false,false,false,false,false,false,false,false)[bag] - scope-28
| | |
| | Cast[chararray] - scope-2
| | |
| | |---Project[bytearray][0] - scope-1
| | |
| | Cast[chararray] - scope-5
| | |
| | |---Project[bytearray][1] - scope-4
| | |
| | Cast[chararray] - scope-8
| | |
| | |---Project[bytearray][2] - scope-7
| | |
| | Cast[float] - scope-11
| | |
| | |---Project[bytearray][3] - scope-10
| | |
| | Cast[float] - scope-14
| | |
| | |---Project[bytearray][4] - scope-13
| | |
| | Cast[float] - scope-17
| | |
| | |---Project[bytearray][5] - scope-16
| | |
| | Cast[float] - scope-20
| | |
| | |---Project[bytearray][6] - scope-19
| | |
| | Cast[int] - scope-23
| | |
| | |---Project[bytearray][7] - scope-22
| | |
| | Cast[float] - scope-26
| | |
| | |---Project[bytearray][8] - scope-25
| |
| |---daily:
Load(hdfs://zly1.sh.intel.com:8020/user/root/NYSE_daily:org.apache.pig.builtin.PigStorage)
- scope-0
|
|---divs: New For Each(false,false,false,false)[bag] - scope-42
| |
| Cast[chararray] - scope-31
| |
| |---Project[bytearray][0] - scope-30
| |
| Cast[chararray] - scope-34
| |
| |---Project[bytearray][1] - scope-33
| |
| Cast[chararray] - scope-37
| |
| |---Project[bytearray][2] - scope-36
| |
| Cast[float] - scope-40
| |
| |---Project[bytearray][3] - scope-39
|
|---divs:
Load(hdfs://zly1.sh.intel.com:8020/user/root/NYSE_dividends:org.apache.pig.builtin.PigStorage)
- scope-29-
{code}
In the PIG-4797.patch, changes are:
1. change the spark plan in JoinOptimizerSpark. when LRA+GLA+PKG is only
encountered in the spark plan, LRA+GLA+PKA will be changed to POJoinSpark
2. In JoinSparkConverter, RDD executes LocalRearrangeFunction which converts
(Tuple) to Tuple2<IndexedKey,Tuple>, CoGroup,GroupPkgFunction which combines
the action of group and package.
Compare the performance:
before the patch: join.pig uses 60 secs.
After the patch:join.pig uses 46 secs.
> Analyze JOIN performance and improve the same.
> ----------------------------------------------
>
> Key: PIG-4797
> URL: https://issues.apache.org/jira/browse/PIG-4797
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: Pallavi Rao
> Assignee: liyunzhang_intel
> Labels: spork
> Attachments: Join performance analysis.pdf, PIG-4797.patch
>
>
> There are a big performance difference in join between spark and mr mode.
> {code}
> daily = load './NYSE_daily' as (exchange:chararray, symbol:chararray,
> date:chararray, open:float, high:float, low:float,
> close:float, volume:int, adj_close:float);
> divs = load './NYSE_dividends' as (exchange:chararray, symbol:chararray,
> date:chararray, dividends:float);
> jnd = join daily by (exchange, symbol), divs by (exchange, symbol);
> store jnd into './join.out';
> {code}
> join.sh
> {code}
> mode=$1
> start=$(date +%s)
> ./pig -x $mode $PIG_HOME/bin/join.pig
> end=$(date +%s)
> execution_time=$(( $end - $start ))
> echo "execution_time:"$excution_time
> {code}
> The execution time:
> || |||mr||spark||
> |join|20 sec|79 sec|
> You can download the test data NYSE_daily and NYSE_dividends in
> https://github.com/alanfgates/programmingpig/blob/master/data/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)