Thanks for the additional comment, Joe.

Frank

On Wed, Jul 20, 2016 at 2:17 PM, Joe Hellerstein <[email protected]>
wrote:

> I heartily concur with this assessment FWIW.  I generally use views or
> WITH clauses myself when building up these kinds of pipelines.
>
> In my experience, the PostgreSQL optimizer is not great at collapsing
> views though.  So caveat emptor.  Users should be made very aware of
> EXPLAIN!
>
> J
>
>
>
> On July 20, 2016 at 12:46:18 PM, Frank McQuillan ([email protected])
> wrote:
>
> Hey Jim,
>
> Thank you for the thoughtful response.
>
> Given your comments, I think we ought to stick with a view as the default
> for sessionize. Looking ahead to MADlib 2.0, one thing we want to better
> support is workflows since chaining operations together is such a common
> data science thing to do. That means looking across all existing MADlib
> functions to determine what changes to returns sets we need to make, such
> as standardizing on views.
>
> Frank
>
> On Wed, Jul 20, 2016 at 6:10 AM, Jim Nasby <[email protected]>
> wrote:
>
> > On 7/19/16 7:36 PM, Frank McQuillan wrote:
> >
> >> "create_view (optional)
> >> BOOLEAN default: TRUE. Determines whether to create a view or
> materialize
> >> a
> >> table as output. If you only needed session info once, creating a view
> >> could be significantly faster than materializing as a table."
> >>
> >> Question is: should it really default TRUE (view) or is it better to
> >> default FALSE (table)? i.e., Or does it really not matter?
> >>
> >
> > tl;dr: it depends, but more importantly I think MADlib should promote
> > views as at least an option (if not the default) across the board.
> >
> > It's going to depend heavily on what you're doing.
> >
> > If you're building a "pipeline" of operations where sessionization is
> just
> > the first step of several, and the session data is only referred to
> once,
> > creating a view gives the planner a lot more flexibility on how to
> produce
> > output.
> >
> > When you use a temp table, the planner has no choice: it must not only
> > materialize the complete result set immediately, but it also has to
> modify
> > the catalog to record the temp table.
> >
> > A pythonic analogy would be that a view is like using a generator (where
> > data only needs to be brought forth as it's consumed) while a temp table
> is
> > like using a list. Except there's an even larger difference in SQL: a
> view
> > means the optimizer has a chance to change the execution plan of the
> final
> > query that's using the sessionize output, taking into account everything
> > you're doing with the data.
> >
> > If there are multiple steps in a pipeline, this difference can become
> very
> > large. I've worked on query chains where switching temp tables to views
> has
> > had a 5-10x impact.
> >
> > The two cases where a temp table would be better are if you need to
> refer
> > to the sessionized data many times (sometimes you'd need to hit it more
> > than twice for the temp table to be a win), or if the optimizer ends up
> > picking a bad plan when everything is combined into a single query. In
> the
> > case of a bad plan, it would still be better to either try and tweak the
> > optimizer settings, or to insert an "optimization fence", typically done
> by
> > sticking an OFFSET 0 clause in.
> >
> > To me, the bigger picture is promoting the option of views across all
> > MADlib set returning operations, because when multiple operations are
> > chained together you can see a very large benefit. I suspect that it's
> more
> > common to chain things together, making views a better default... but
> > that's just a guess. So if there was a standard default for this across
> the
> > board, perhaps views would be best.
> > --
> > Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> > Experts in Analytics, Data Architecture and PostgreSQL
> > Data in Trouble? Get it in Treble! http://BlueTreble.com
> > 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
> >
>
>

Reply via email to