On 7/19/16 7:36 PM, Frank McQuillan wrote:
"create_view (optional)
BOOLEAN default: TRUE. Determines whether to create a view or materialize a
table as output. If you only needed session info once, creating a view
could be significantly faster than materializing as a table."

Question is:  should it really default TRUE (view) or is it better to
default FALSE (table)? i.e., Or does it really not matter?

tl;dr: it depends, but more importantly I think MADlib should promote views as at least an option (if not the default) across the board.

It's going to depend heavily on what you're doing.

If you're building a "pipeline" of operations where sessionization is just the first step of several, and the session data is only referred to once, creating a view gives the planner a lot more flexibility on how to produce output.

When you use a temp table, the planner has no choice: it must not only materialize the complete result set immediately, but it also has to modify the catalog to record the temp table.

A pythonic analogy would be that a view is like using a generator (where data only needs to be brought forth as it's consumed) while a temp table is like using a list. Except there's an even larger difference in SQL: a view means the optimizer has a chance to change the execution plan of the final query that's using the sessionize output, taking into account everything you're doing with the data.

If there are multiple steps in a pipeline, this difference can become very large. I've worked on query chains where switching temp tables to views has had a 5-10x impact.

The two cases where a temp table would be better are if you need to refer to the sessionized data many times (sometimes you'd need to hit it more than twice for the temp table to be a win), or if the optimizer ends up picking a bad plan when everything is combined into a single query. In the case of a bad plan, it would still be better to either try and tweak the optimizer settings, or to insert an "optimization fence", typically done by sticking an OFFSET 0 clause in.

To me, the bigger picture is promoting the option of views across all MADlib set returning operations, because when multiple operations are chained together you can see a very large benefit. I suspect that it's more common to chain things together, making views a better default... but that's just a guess. So if there was a standard default for this across the board, perhaps views would be best.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Reply via email to