How would flat files work? The data needs to be updated by every pig run. On Dec 3, 2012, at 11:10 PM, Prashant Kommireddi <[email protected]> wrote:
> Awesome! It would be good to have a flat-file based impl as there will > probably a lot of pig users not having an hbase instance setup for > stats persistence. Let me know if I can help in anyway. > > Is there a timeframe you are looking at for open-sourcing this? > > > On Dec 4, 2012, at 12:32 PM, Bill Graham <[email protected]> wrote: > >> We do basically what you're describing. Each of our scripts has a logical >> name which defines the workflow. For each job in the workflow we persist >> the job stats, counters and conf in HBase via an implementation of >> PigProgressNotificationListener. We can then correlate jobs in a run of the >> workflow together based on the pig.script.start.time and pig.job.start time >> properties. We use the logical plan script signature to determine the >> script version has changed. >> >> During job execution we query the service in a impl of PigReducerEstimator >> for matching workflows. >> >> One simple estimation algo is to multiply Pig's default estimated reducers >> (which are based on mapper input bytes) by the ratio of mapper output bytes >> over mapper input bytes of previous runs. The same could also be done with >> slot time, but we haven't tried that yet. >> >> We plan to open source parts of this at some point. >> >> >> On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi >> <[email protected]>wrote: >> >>> I have been thinking about using Pig statistics for # reducers estimation. >>> Though the current heuristic approach >>> works fine, it requires an admin or the programmer to understand what the >>> best number would be for the job. >>> We are seeing a large number of jobs over-utilizing resources, and there is >>> obviously no default number that works well >>> for all kinds of pig scripts. A few non-technical users find it difficult >>> to estimate the best # for their jobs. >>> It would be great if we can use stats from previous runs of a job to set >>> the number >>> of reducers for future runs. >>> >>> This would be a nice feature for jobs running in production, where the job >>> or the dataset size does not fluctuate >>> a huge deal. >>> >>> >>> 1. Set a config param in the script >>> - set script.unique.id prashant.1111222111.demo_script >>> 2. If the above is not set, we fallback on the current implementation >>> 3. If the above is set >>> - At the end of the job, persist PigStats (namely Reduce Shuffle >>> Bytes) to FS (hdfs, local, s3....). This would be >>> "${script.unique.id}_YYYYMMDDHHmmss" >>> - lets call this stats_on_hdfs >>> - Read "stats_on_hdfs" for previous runs, and based on the number of >>> such stats to read (based on >>> script.reducer.estimation.num.stats) calculate >>> an average number of reducers for the current run. >>> - If no stats_on_hdfs exists, we fallback on current implementation >>> >>> It will be advised to not keep the retention of stats too long, and Pig can >>> make sure to clear up old files that are not required. >>> >>> What do you guys think? >>> >>> -Prashant >>> >> >> >> >> -- >> *Note that I'm no longer using my Yahoo! email address. Please email me at >> [email protected] going forward.*
