reduce io during sharing scans of the same input datasets ----------------------------------------------------------
Key: PIG-614 URL: https://issues.apache.org/jira/browse/PIG-614 Project: Pig Issue Type: Improvement Components: impl Affects Versions: types_branch Reporter: Samuel Guo Priority: Minor Fix For: types_branch If we want to store different results that generated from the same input dataset, now we need to write two or several *STORE* clauses. And these *STORE* clauses will be translated to different mr jobs despite of these mr jobs may share scans of the same input datasets. for example: Dataset 'weather' contains the records of the weather. Each record contains three part : wind/air/tempreture. we need to process different part of the records. we may write a pig script as below: weather = load 'weather.txt' as (wind, air, tempreture); wind_results = ... wind ...; air_results = ...air...; temp_results = ...tempreture...; store wind_results into 'wind.results'; store air_results into 'air.results'; store temp_results into 'temp.results'; now pig will translate this script into three different MR jobs wich run sequencely: scan 'weather.txt', process the wind data, store the wind results; scan 'weather.txt' again, process the air data, store the air results; ... if the input data set is large, it is not efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.