There are a few open JIRAs that are related to refactoring the query plan code to allow for stats-based runtime optimizations:
https://issues.apache.org/jira/browse/PIG-483 https://issues.apache.org/jira/browse/PIG-2784 If anyone has thoughts/opinions around suggested design changes, those JIRAs could be a good place to chime it. On Mon, Aug 6, 2012 at 5:18 PM, Dmitriy Ryaboy <[email protected]> wrote: > + 1 to that. > > We can get stats from the Hive metadata catalog via HCat. Loaders can > already implement the LoadStatistics interface -- and if HCatLoader > does this, we can create them via Hive and use that team's great work. > We should also allow stats to be passed (and modified appropriately) > through the dag, and instrument intermediate data writers to collect > stats and send telemetry back for improved flow planning, but that's a > separate conversation. > > D > > On Mon, Aug 6, 2012 at 10:35 AM, Alan Gates <[email protected]> wrote: > > Pig does not have a metadata store, so it doesn't store statistics on > data. However, through HCatalog it will have access to the same statistics > that Hive stores. > > > > As far as using this data to optimize Pig operations, I'd like to rework > the backend to start taking advantage of such statistics when available > (either from metadata like this or statistics that are generated on the fly > as scripts are executed). I also hope to share as much of this work as > possible with Hive so that both can benefit. > > > > Alan. > > > > On Aug 5, 2012, at 1:12 AM, Prasanth J wrote: > > > >> Hello everyone > >> > >> Came across this excellent post about storing column statistics in Hive > http://www.cloudera.com/blog/2012/08/column-statistics-in-hive/ > >> > >> Does pig gather statistics similar to what hive does? I think gathering > such statistics will be very helpful not only for cost based optimizer but > in other cases like knowing the count of rows, knowing the histogram of > underlying data etc.. In my case, I am working on cube computation for > holistic measure where I need to know the count of rows, based on it I can > load sample data set for determining the partition factor for large groups. > I am sure gathering statistics and persisting it will help in other > cases/optimizations as well. > >> > >> If I am right, pig doesn't use cost based estimation while optimizing > the logical plan instead I believe it uses rules of thumb (Plz. correct me > if I am wrong). Having statistics about the datasets would help to provide > better optimization (similar to the join optimization in the blog post). > Any thoughts about having such statistics in pig and implementing ANALYZE > command for gathering statistics? > >> > >> Thanks > >> -- Prasanth Jayachandran > >> > > > -- *Note that I'm no longer using my Yahoo! email address. Please email me at [email protected] going forward.*
