[jira] Created: (PIG-614) reduce io during sharing scans of the same input datasets

Samuel Guo (JIRA) Sun, 11 Jan 2009 23:31:23 -0800

reduce io during sharing scans of the same input datasets 
----------------------------------------------------------


                 Key: PIG-614
                 URL: https://issues.apache.org/jira/browse/PIG-614
             Project: Pig
          Issue Type: Improvement
          Components: impl
    Affects Versions: types_branch
            Reporter: Samuel Guo
            Priority: Minor
             Fix For: types_branch


If we want to store different results that generated from the same input 
dataset, now we need to write two or several *STORE* clauses. And these *STORE* 
clauses will be translated to different mr jobs despite of these mr jobs may 
share scans of the same input datasets.

for example:
Dataset 'weather' contains the records of the weather. Each record contains 
three part : wind/air/tempreture. we need to process different part of the 
records.
we may write a pig script as below:

weather = load 'weather.txt' as (wind, air, tempreture);
wind_results = ... wind ...;
air_results = ...air...;
temp_results = ...tempreture...;
store wind_results into 'wind.results';
store air_results into 'air.results';
store temp_results into 'temp.results';

now pig will translate this script into three different MR jobs wich run 
sequencely: scan 'weather.txt', process the wind data, store the wind results; 
scan 'weather.txt' again, process the air data, store the air results; ... 

if the input data set is large, it is not efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-614) reduce io during sharing scans of the same input datasets

Reply via email to