[ 
https://issues.apache.org/jira/browse/PIG-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233116#comment-13233116
 ] 

Thejas M Nair commented on PIG-1270:
------------------------------------

These are the results I got from running tests to check the performance on 
larger data.

Query - 
{code}
grunt> l = load '/tmp/bigfile2' as (a,b,c);
grunt> lim = limit l 10;
grunt> dump lim;
{code}

Ran on a cluster with 8 map slots. 

With 128MB block size , 499 Maps -
|| || trunk || trunk+patch ||
|avg Run time | 17 min 7 sec| 6 min 44 sec |
|avg run time of map | 12 sec | 4 sec|


With smaller number of splits the numbers are better - 
With 'set pig.maxCombinedSplitSize 1073741824' (ie split size of 1G) and 64 
Maps -
|| || trunk || trunk+patch ||
|avg Run time | 15 min 19 sec| 1 min 10 sec |
|avg run time of map | 106 sec | 4 sec|

                
> Push limit into loader
> ----------------------
>
>                 Key: PIG-1270
>                 URL: https://issues.apache.org/jira/browse/PIG-1270
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>             Fix For: 0.10
>
>         Attachments: PIG-1270-1.patch, PIG-1270-2.patch, PIG-1270-3.patch
>
>
> We can optimize limit operation by stopping early in PigRecordReader. In 
> general, we need a way to communicate between PigRecordReader and execution 
> pipeline. POLimit could instruct PigRecordReader that we have already had 
> enough records and stop feeding more data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to