Olga Natkovich resolved PIG-614.

    Resolution: Duplicate

This issue will be addressed by https://issues.apache.org/jira/browse/PIG-627

> reduce io during sharing scans of the same input datasets 
> ----------------------------------------------------------
>                 Key: PIG-614
>                 URL: https://issues.apache.org/jira/browse/PIG-614
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Samuel Guo
>            Priority: Minor
>             Fix For: types_branch
> If we want to store different results that generated from the same input 
> dataset, now we need to write two or several *STORE* clauses. And these 
> *STORE* clauses will be translated to different mr jobs despite of these mr 
> jobs may share scans of the same input datasets.
> for example:
> Dataset 'weather' contains the records of the weather. Each record contains 
> three part : wind/air/tempreture. we need to process different part of the 
> records.
> we may write a pig script as below:
> weather = load 'weather.txt' as (wind, air, tempreture);
> wind_results = ... wind ...;
> air_results = ...air...;
> temp_results = ...tempreture...;
> store wind_results into 'wind.results';
> store air_results into 'air.results';
> store temp_results into 'temp.results';
> now pig will translate this script into three different MR jobs wich run 
> sequencely: scan 'weather.txt', process the wind data, store the wind 
> results; scan 'weather.txt' again, process the air data, store the air 
> results; ... 
> if the input data set is large, it is not efficient.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to