You could store stats for the last X runs and take an average for the next run.
Sent from my iPhone On Dec 6, 2012, at 8:09 PM, Dmitriy Ryaboy <[email protected]> wrote: > How would flat files work? The data needs to be updated by every pig run. > > On Dec 3, 2012, at 11:10 PM, Prashant Kommireddi <[email protected]> wrote: > >> Awesome! It would be good to have a flat-file based impl as there will >> probably a lot of pig users not having an hbase instance setup for >> stats persistence. Let me know if I can help in anyway. >> >> Is there a timeframe you are looking at for open-sourcing this? >> >> >> On Dec 4, 2012, at 12:32 PM, Bill Graham <[email protected]> wrote: >> >>> We do basically what you're describing. Each of our scripts has a logical >>> name which defines the workflow. For each job in the workflow we persist >>> the job stats, counters and conf in HBase via an implementation of >>> PigProgressNotificationListener. We can then correlate jobs in a run of the >>> workflow together based on the pig.script.start.time and pig.job.start time >>> properties. We use the logical plan script signature to determine the >>> script version has changed. >>> >>> During job execution we query the service in a impl of PigReducerEstimator >>> for matching workflows. >>> >>> One simple estimation algo is to multiply Pig's default estimated reducers >>> (which are based on mapper input bytes) by the ratio of mapper output bytes >>> over mapper input bytes of previous runs. The same could also be done with >>> slot time, but we haven't tried that yet. >>> >>> We plan to open source parts of this at some point. >>> >>> >>> On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi >>> <[email protected]>wrote: >>> >>>> I have been thinking about using Pig statistics for # reducers estimation. >>>> Though the current heuristic approach >>>> works fine, it requires an admin or the programmer to understand what the >>>> best number would be for the job. >>>> We are seeing a large number of jobs over-utilizing resources, and there is >>>> obviously no default number that works well >>>> for all kinds of pig scripts. A few non-technical users find it difficult >>>> to estimate the best # for their jobs. >>>> It would be great if we can use stats from previous runs of a job to set >>>> the number >>>> of reducers for future runs. >>>> >>>> This would be a nice feature for jobs running in production, where the job >>>> or the dataset size does not fluctuate >>>> a huge deal. >>>> >>>> >>>> 1. Set a config param in the script >>>> - set script.unique.id prashant.1111222111.demo_script >>>> 2. If the above is not set, we fallback on the current implementation >>>> 3. If the above is set >>>> - At the end of the job, persist PigStats (namely Reduce Shuffle >>>> Bytes) to FS (hdfs, local, s3....). This would be >>>> "${script.unique.id}_YYYYMMDDHHmmss" >>>> - lets call this stats_on_hdfs >>>> - Read "stats_on_hdfs" for previous runs, and based on the number of >>>> such stats to read (based on >>>> script.reducer.estimation.num.stats) calculate >>>> an average number of reducers for the current run. >>>> - If no stats_on_hdfs exists, we fallback on current implementation >>>> >>>> It will be advised to not keep the retention of stats too long, and Pig can >>>> make sure to clear up old files that are not required. >>>> >>>> What do you guys think? >>>> >>>> -Prashant >>>> >>> >>> >>> >>> -- >>> *Note that I'm no longer using my Yahoo! email address. Please email me at >>> [email protected] going forward.*
