[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899605#action_12899605
]
Yan Zhou commented on PIG-1518:
-------------------------------
One experimental result on a 15-node cluster of 2 x Xeon L5420 2.50GHz/16G RAM
boxes is as follows:
Query:
register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, ip_addr, timestamp,
estimated_revenue, page_info, page_links);
B = foreach A generate user, (double)estimated_revenue;
B1 = distinct B;
alpha = load '/user/pig/tests/data/pigmix/users' using PigStorage('\u0001') as
(name, phone, address,
city, state, zip);
beta = foreach alpha generate name;
C = join beta by name, B1 by user parallel 300;
D = group C by $0 parallel 40;
E = foreach D generate group, SUM(C.estimated_revenue);
store E into 'spliCombo2.out';
It creates 3 map/reduce jobs.
No Split Combination:
||Mappers|Reducers|
|number|120|300|
|elapsed time|24s|2m43s|
|number|301|300|
|elapsed time|46s|3m11s|
|number|300|40|
|elapsed time|38s|53s|
|Total elapsed time|7m36s|
With Split Combination:
||mappers|Reducers|
|number|120|300|
|elapsed time|22s|2m49s|
|number|3|300|
|elapsed time|27s|2m46s|
|number|1|40|
|elapsed time|17s|24s|
|Total elapsed time|7m5s|
> multi file input format for loaders
> -----------------------------------
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
> Issue Type: Improvement
> Reporter: Olga Natkovich
> Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files
> in the input. In this case a separate map is created for each file which
> could be very inefficient.
> It would be greate to have an umbrella input format that can take multiple
> files and use them in a single split. We would like to see this working with
> different data formats if possible.
> There are already a couple of input formats doing similar thing:
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works
> with ne Hadoop 20 API.
> We at least want to do a feasibility study for Pig 0.8.0.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.