Hello Simon,

Simon Riggs wrote:
I've come
up with an alternative concept to allow us to discuss the particular
merits of each. ISTM that this new proposal has considerable potential.

Hm.. interesting idea.

If we were able to keep track of which sections of a table are now
read-only then we would be able to record information about the data in
that section and use it to help solve queries. This is turning the
current thinking on its head: we could derive the constraints from the
data, rather than placing the data according to the constraints. That
would be much more natural: load data into the table and have the system
work out the rest.

Yeah, but that's also the most limiting factor of your approach: it covers only horizontal partitioning by time (or to be more precise, by columns which are very likely to increase or decrease with time). All other columns will very likely contain values from the full range of possible values.

As you have pointed out, that might be a very frequent use case. I can't argue about that, however, I think it's important to be well aware of that limitation.

Other scans types might also use segment exclusion, though this would
only be useful for scans retrieving many rows, otherwise the overhead of
segment exclusion might not be worthwhile.

Uh.. the overhead of checking against min/max values doesn't seem that big to me.

I rather think the gain for index scans would be prohibitively small, because (given frequent enough vacuuming) an index scan shouldn't return many pointers to tuples in segments which could be optimized away by segment exclusion.

If we collect data for all columns then many of our implicit constraints
would be useless. e.g. if a column only has a few values and these are
present in all segments. Matching our predicate against all constraints
would be expensive, so we must weed out poor constraints. We would do
this by removing any constraint that overlapped more than 10% of other
segments. Various heuristics would likely need to be discovered during
development to make this work without resorting to manual commands.

Uh, well, that's about the limitation I've pointed out above. But is it really worth maintaining statistics about overlapping values and removing min/max checks for certain columns?

It would save you the min/max check per segment and scan, but cost maintaining the statistics and checking against the statistics once per scan. AFAICS the block with the min/max tuple per segment will often remain cached anyway... dunno.

Noting which segments are read-only
-----------------------------------

Everything so far has relied upon our ability to note which segments of
a table are read-only. We could do this in two ways

1) have the system automatically keep track of non-changing data
2) have the DBA issue a command to "mark" a segment as read-only now

Probably a combination of the two is better, so we have three states for
segments
- READ_WRITE_ALLOWED
- EFFECTIVE_READ_ONLY (set by 1 only)
- MARKED_READ_ONLY (set by 2 only)

Having said that I want to concentrate on (1), though consider the other
one also in case requested by reviewers.

Hm.. AFAICT, horizontal partitioning often serves manageability, which is quite limited having all data in one table (you can't move a single segment to a different tablespace). Thus I think option 2 is pretty constrained is usability. What would the DBA gain by setting a segment to read only? How does the DBA figure out, in which segment a tuple is stored in (so she can decide to mark it read only)?

The user may also wish to clear down very old data, so allowing DELETEs
can ensure the user can still remove old data from the table. By
carefully choosing the values to be deleted, a whole segment can be
deleted and then returned to the FSM.

Oh, yeah, that sounds like a good optimization. Bulk deletes, yummie!

This proposal offers many of the advantages of the earlier Visibility
Map proposal, yet without major changes to heap structure. Since the
segment-level visibility map is more granular it would only be 80% as
effective as the more detailed block-level map. Having said that, the
bookkeeping overheads will also be considerably reduced, so it does seem
a good joint approach. It also allows freezing to be handled fully,
which was a problem with the original visibility map proposal. WAL
logging visibility map changes is also now much easier.

I generally agree, although I'm somewhat dubious about the 80% factor.

My thoughts have been targeted directly at partitioning, yet I have to
admit that this idea overlaps, and in my world view, replaces the
Visibility Map proposal. I very much like the name Visibility Map
though.

I would even say, that partitioning is somewhat of a misleading term here, because it normally allows the DBA to decide on where to split.

Given that we are operating on segments here, to which the DBA has very limited information and access, I prefer the term "Segment Exclusion". I think of that as an optimization of sequential scans on tables with the above mentioned characteristics.

If we do need to differentiate between the two proposals, we can refer
to this one as the Segment Visibility Map (SVM).

I'm clearly in favor of separating between the two proposals. SVM is a good name, IMHO.

We can handle select count(*) by scanning the non-100% visible segments
of a table, then adding the stored counts for each segment to get a
final total. Not sure if its really worth doing, but it does sound like
an added bonus.

Yup, sounds tempting. Although it's contrary to Postgres position so far. And one could argue that you'd have to maintain not only count(), but also avg(), sum(), etc... as well for all tuples in the 100% visible segment.

There would be additional complexity in selectivity estimation and plan
costing. The above proposal allows dynamic segment exclusion, which
cannot be assessed at planning time anyway, so suggestions welcome...

Hm.. that looks like a rather bad downside of an executor-only optimization.

Comparison with other Partitioning approaches
---------------------------------------------

Once I realised this was possible in fairly automatic way, I've tried
hard to keep away from manual overrides, commands and new DDL.

Declarative partitioning is a big overhead, though worth it for large
databases. No overhead is *much* better though.

This approach to partitioning solves the following challenges
- allows automated table extension, so works automatically with Slony
- responds dynamically to changing data
- allows stable functions, nested loop joins and parametrised queries
- allows RI via SHARE locks
- avoids the need for operator push-down through Append nodes
- allows unique indexes
- allows both global indexes (because we only have one table)
- allows advanced planning/execution using read-only/visible data
- works naturally with synchronous scans and buffer recycling

All of the above are going to take considerably longer to do in any of
the other ways I've come up with so far...

I fully agree. But as I tried to point out above, the gains in manageability from Segment Exclusion are also pretty close to zero. So I'd argue they only fulfill parts of the needs for general horizontal partitioning.

This technique would be useful for any table with historical data keyed
by date or timestamp. It would also be useful for data where a
time-of-insert component is implicit, such as many major entity tables
where the object ids are assigned by a sequence. e.g. an Orders table
with an OrderId as PK. Once all orders placed in a period have been
shipped/resolved/closed then the segments will be marked read-only.

Agreed. Just a minor note: I find "marked read-only" too strong, as it implies an impossibility to write. I propose speaking about mostly-read segments, or optimized for reading or similar.

Its not really going to change the code path much for small tables, yet
seems likely to work reasonably well for large tables of all shapes and
sizes.

That sounds a bit too optimistic to me. For Segment Exclusion, it takes only *one* tuple to enlarge the min/max range dramatically in any direction. So it's not the overall correlation between column values and storage location, but rather only the min/max column values which matter. Have you ever checked, how well these min/max values correlate with the segment number?

Pretty much the same argument applies to SVM: an update to only one tuple in a segment is enough to remove the optimization for reading (EFFECTIVE_READ_ONLY property) for the segment. The assumption here (being that updates happen mostly to newer segments) is not quite the same as above.

Maybe a combination with CLUSTERing would be worthwhile? Or even enforced CLUSTERing for the older segments?

If a segment is being updated, we leave it alone, and maybe never
actually set the visibility map at all. So overall, this idea seems to
cover the main use case well, yet with only minimal impact on the
existing use cases we support.

Yup.

As before, I will maintain this proposal on the PG developer Wiki, once
we get down to detailed design.

Cool.

Thanks for working out yet another great proposal. I hope to have been of help with my questions and remarks. ;-)

Regards

Markus


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to