Re: Reducer estimation

Dmitriy Ryaboy Thu, 06 Dec 2012 06:39:29 -0800

How would flat files work? The data needs to be updated by every pig run. 

On Dec 3, 2012, at 11:10 PM, Prashant Kommireddi <[email protected]> wrote:


> Awesome! It would be good to have a flat-file based impl as there will
> probably a lot of pig users not having an hbase instance setup for
> stats persistence. Let me know if I can help in anyway.
> 
> Is there a timeframe you are looking at for open-sourcing this?
> 
> 
> On Dec 4, 2012, at 12:32 PM, Bill Graham <[email protected]> wrote:
> 
>> We do basically what you're describing. Each of our scripts has a logical
>> name which defines the workflow. For each job in the workflow we persist
>> the job stats, counters and conf in HBase via an implementation of
>> PigProgressNotificationListener. We can then correlate jobs in a run of the
>> workflow together based on the pig.script.start.time and pig.job.start time
>> properties. We use the logical plan script signature to determine the
>> script version has changed.
>> 
>> During job execution we query the service in a impl of PigReducerEstimator
>> for matching workflows.
>> 
>> One simple estimation algo is to multiply Pig's default estimated reducers
>> (which are based on mapper input bytes) by the ratio of mapper output bytes
>> over mapper input bytes of previous runs. The same could also be done with
>> slot time, but we haven't tried that yet.
>> 
>> We plan to open source parts of this at some point.
>> 
>> 
>> On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi 
>> <[email protected]>wrote:
>> 
>>> I have been thinking about using Pig statistics for # reducers estimation.
>>> Though the current heuristic approach
>>> works fine, it requires an admin or the programmer to understand what the
>>> best number would be for the job.
>>> We are seeing a large number of jobs over-utilizing resources, and there is
>>> obviously no default number that works well
>>> for all kinds of pig scripts. A few non-technical users find it difficult
>>> to estimate the best # for their jobs.
>>> It would be great if we can use stats from previous runs of a job to set
>>> the number
>>> of reducers for future runs.
>>> 
>>> This would be a nice feature for jobs running in production, where the job
>>> or the dataset size does not fluctuate
>>> a huge deal.
>>> 
>>> 
>>>  1. Set a config param in the script
>>>     - set script.unique.id prashant.1111222111.demo_script
>>>  2. If the above is not set, we fallback on the current implementation
>>>  3. If the above is set
>>>     - At the end of the job, persist PigStats (namely Reduce Shuffle
>>>     Bytes) to FS (hdfs, local, s3....). This would be
>>> "${script.unique.id}_YYYYMMDDHHmmss"
>>>     - lets call this stats_on_hdfs
>>>     - Read "stats_on_hdfs" for previous runs, and based on the number of
>>>     such stats to read (based on
>>> script.reducer.estimation.num.stats) calculate
>>>     an average number of reducers for the current run.
>>>     - If no stats_on_hdfs exists, we fallback on current implementation
>>> 
>>> It will be advised to not keep the retention of stats too long, and Pig can
>>> make sure to clear up old files that are not required.
>>> 
>>> What do you guys think?
>>> 
>>> -Prashant
>>> 
>> 
>> 
>> 
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>> [email protected] going forward.*

Re: Reducer estimation

Reply via email to