[jira] Updated: (PIG-1198) [zebra] performance improvements

Chao Wang (JIRA) Thu, 25 Feb 2010 17:00:54 -0800

     [ 
https://issues.apache.org/jira/browse/PIG-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chao Wang updated PIG-1198:
---------------------------


Patch reviewed.

Some feedbacks:

1) in fillRowSplit() method, reader.close() should always be called at the end;

2) in mapreduce.TableInputFormat.getRowSplits(), batchSize variable is not 
needed.


Patch looks good overall +1


> [zebra] performance improvements
> --------------------------------
>
>                 Key: PIG-1198
>                 URL: https://issues.apache.org/jira/browse/PIG-1198
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Yan Zhou
>            Assignee: Yan Zhou
>             Fix For: 0.7.0
>
>         Attachments: PIG-1198.patch, PIG-1198.patch
>
>
> Current input split generation is row-based split on individual TFiles. This 
> leaves undesired fact that even for TFiles smaller than one block one split 
> is still generated for each. Consequently, there will be many mappers, and 
> many waves, needed to handle the many small TFiles generated by as many 
> mappers/reducers that wrote the data. This issue can be addressed by 
> generating input splits that can include multiple TFiles. 
> For sorted tables, key distribution generation by table, which is used to 
> generated proper input splits, includes key distributions from column groups 
> even they are not in projection. This incurs extra cost to perform 
> unnecessary computations and, more inappropriately, creates unreasonable 
> results on input split generations; 
> For unsorted tables, when row split is generated on a union of tables, the 
> FileSplits are generated for each table and then lumped together to form the 
> final list of splits to Map/Reduce. This has a undesirable fact that number 
> of splits is subject to the number of tables in the table union and not just 
> controlled by the number of splits used by the Map/Reduce framework; 
> The input split's goal size is calculated on all column groups even if some 
> of them are not in projection; 
> For input splits of multiple files in one column group, all files are opened 
> at startup. This is unnecessary and takes unnecessarily resources from start 
> to end. The files should be opened when needed and closed when not; 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1198) [zebra] performance improvements

Reply via email to