Thanks for the additional comment, Joe. Frank
On Wed, Jul 20, 2016 at 2:17 PM, Joe Hellerstein <[email protected]> wrote: > I heartily concur with this assessment FWIW. I generally use views or > WITH clauses myself when building up these kinds of pipelines. > > In my experience, the PostgreSQL optimizer is not great at collapsing > views though. So caveat emptor. Users should be made very aware of > EXPLAIN! > > J > > > > On July 20, 2016 at 12:46:18 PM, Frank McQuillan ([email protected]) > wrote: > > Hey Jim, > > Thank you for the thoughtful response. > > Given your comments, I think we ought to stick with a view as the default > for sessionize. Looking ahead to MADlib 2.0, one thing we want to better > support is workflows since chaining operations together is such a common > data science thing to do. That means looking across all existing MADlib > functions to determine what changes to returns sets we need to make, such > as standardizing on views. > > Frank > > On Wed, Jul 20, 2016 at 6:10 AM, Jim Nasby <[email protected]> > wrote: > > > On 7/19/16 7:36 PM, Frank McQuillan wrote: > > > >> "create_view (optional) > >> BOOLEAN default: TRUE. Determines whether to create a view or > materialize > >> a > >> table as output. If you only needed session info once, creating a view > >> could be significantly faster than materializing as a table." > >> > >> Question is: should it really default TRUE (view) or is it better to > >> default FALSE (table)? i.e., Or does it really not matter? > >> > > > > tl;dr: it depends, but more importantly I think MADlib should promote > > views as at least an option (if not the default) across the board. > > > > It's going to depend heavily on what you're doing. > > > > If you're building a "pipeline" of operations where sessionization is > just > > the first step of several, and the session data is only referred to > once, > > creating a view gives the planner a lot more flexibility on how to > produce > > output. > > > > When you use a temp table, the planner has no choice: it must not only > > materialize the complete result set immediately, but it also has to > modify > > the catalog to record the temp table. > > > > A pythonic analogy would be that a view is like using a generator (where > > data only needs to be brought forth as it's consumed) while a temp table > is > > like using a list. Except there's an even larger difference in SQL: a > view > > means the optimizer has a chance to change the execution plan of the > final > > query that's using the sessionize output, taking into account everything > > you're doing with the data. > > > > If there are multiple steps in a pipeline, this difference can become > very > > large. I've worked on query chains where switching temp tables to views > has > > had a 5-10x impact. > > > > The two cases where a temp table would be better are if you need to > refer > > to the sessionized data many times (sometimes you'd need to hit it more > > than twice for the temp table to be a win), or if the optimizer ends up > > picking a bad plan when everything is combined into a single query. In > the > > case of a bad plan, it would still be better to either try and tweak the > > optimizer settings, or to insert an "optimization fence", typically done > by > > sticking an OFFSET 0 clause in. > > > > To me, the bigger picture is promoting the option of views across all > > MADlib set returning operations, because when multiple operations are > > chained together you can see a very large benefit. I suspect that it's > more > > common to chain things together, making views a better default... but > > that's just a guess. So if there was a standard default for this across > the > > board, perhaps views would be best. > > -- > > Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX > > Experts in Analytics, Data Architecture and PostgreSQL > > Data in Trouble? Get it in Treble! http://BlueTreble.com > > 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 > > > >
