Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

2017-04-12 Thread Nicolas Barbier
2017-04-11 Robert Haas :

> There's a nasty trade-off here between XID consumption (and the
> aggressive vacuums it eventually causes) and preserving performance in
> the face of errors - e.g. if you make k = 100,000 you consume 100x
> fewer XIDs than if you make k = 1000, but you also have 100x the work
> to redo (on average) every time you hit an error.

You could make it dynamic: Commit the subtransaction even when not
encountering any error after N lines (N starts out at 1), then double
N and continue. When encountering an error, roll back the current
subtransaction back and re-insert all the known good rows that have
been rolled back (plus maybe the erroneous row into a separate table
or whatever) in one new subtransaction and commit; then reset N to 1
and continue processing the rest of the file.

That would work reasonable well whenever the ratio of erroneous rows
is not extremely high: whether the erroneous rows are all clumped
together, entirely randomly spread out over the file, or a combination
of both.

> If the data quality is poor (say, 50% of lines have errors) it's
> almost impossible to avoid runaway XID consumption.

Yup, that seems difficult to work around with anything similar to the
proposed. So the docs might need to suggest not to insert a 300 GB
file with 50% erroneous lines :-).

Greetings,

Nicolas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bitmap index scans use of filters on available columns

2015-11-04 Thread Nicolas Barbier
2015-11-04 Antonin Houska :

> While prefix expression
>
> y like 'abc%'
>
> can be converted to
>
> y >= 'abc'
>
> (see expand_indexqual_opclause()), I'm not sure any kind of expansion is
> possible for '%abc%' which would result in a b-tree searchable condition.

I think the question is not about using the b-tree for checking the
condition, but about just retrieving the value for y from the index,
and just using that to check the condition before fetching the
corresponding tuple from the heap.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a funnel by any other name

2015-09-17 Thread Nicolas Barbier
2015-09-17 Robert Haas :

> 1. Exchange Bushy
> 2. Exchange Inter-Operator (this is what's currently implemented)
> 3. Exchange Replicate
> 4. Exchange Merge
> 5. Interchange

> 1. ?
> 2. Gather
> 3. Broadcast (sorta)
> 4. Gather Merge
> 5. Redistribute

> 1. Parallel Child
> 2. Parallel Gather
> 3. Parallel Replicate
> 4. Parallel Merge
> 5. Parallel Redistribute

FYI, SQL Server has these in its execution plans:

* Distribute Streams: read from one thread, write to multiple threads
* Repartition Streams: both read and write from/to multiple threads
* Gather Streams: read from multiple threads, write to one thread

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2015-09-16 Thread Nicolas Barbier
2015-09-16 Rod Taylor :

> 2015-09-15 Anastasia Lubennikova :
>
>> - We have a table tbl(f1, f2, f3, f4).
>> - We want to have an unique index on (f1,f2).
>> - We want to have an index on (f1, f2, f3) which allow us to use index for
>> complex "where" clauses.
>
> Can someone write a query where F3 being ordered is a contribution?
>
> If F1 and F2 are unique, adding F3 to a where or order by clause doesn't
> seem to contribute anything.

After thinking about it a bit more, it indeed seems never useful to
have f3 in the internal nodes if it is not part of the columns that
determine the UNIQUE property. It could as well be pushed out of the
internal nodes and only appear in the leaf nodes.

In other words: It seems only useful to have a list of columns that
appear in the internal nodes AND to which the UNIQUE property applies,
plus an addition list of columns whose values are only stored in the
leaf nodes (to create a “covering index”). For non-UNIQUE indexes,
there is also only need for two lists of columns.

I don’t understand the case where it is useful anyway, according to David:

2015-09-16 David Rowley :

> Joining relations may have more than one matching tuple for any given unique
> tuple, therefore the tuples may no longer be unique on the columns which are
> in the unique index.

Could you elaborate a bit on how this is relevant to Rod’s question? I
seem to be missing something here.

greetings,

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS][PROPOSAL] Covering + unique indexes.

2015-09-15 Thread Nicolas Barbier
2015-09-15 David Rowley :

> I'm also a bit confused where f3 comes in here. If it's UNIQUE on (f1,f2)
> and we include f4. Where's f3?

Columns f1, f2, f3 are in the internal nodes of the tree (i.e., they
are used to find the ultimate leaf nodes). f4 is only in the leaf
nodes. If f4 are typically big values, and they are typically not used
in the search predicate, it makes the upper part of the index (which
determines how many levels the index has) larger for no good reason.
f4 can still be retrieved without going to the heap, so including it
in the leaf nodes makes it possible to do index-only scans more often.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Eliminating CREATE INDEX comparator TID tie-breaker overhead

2015-07-24 Thread Nicolas Barbier
2015-07-23 Robert Haas robertmh...@gmail.com:

 I think what you should do is go find out whether the second rationale
 is valid or not.

Knowing how much impact on performance such “non TID ordered” entries
have, would of course be very useful for future patch authors to know.
Especially useful would be to know whether interleaving a small number
of TID ordered streams (as would probably be generated by parallel
scans/processing) would result in an ordering that performs
significantly worse or not. I assume (but cannot prove) that in this
case the OS will understand the read pattern as being multiple streams
and prefetching will work correctly.

 I'm not going to endorse the notion that tuplesort.c will only DTRT if
 it receives tuples in TID order; it cannot be the responsibility of
 the caller of the sort code to ensure that the tuples are sorted.

Except that it will do the right thing (as in correctness), but maybe
result in not the best overall performance possible (for future
queries). I think that it is a typical property of “reasons for
performance to be good” that they rely on a lot of code that is
otherwise independent, to work together the right way.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Eliminating CREATE INDEX comparator TID tie-breaker overhead

2015-07-24 Thread Nicolas Barbier
2015-07-24 Nicolas Barbier nicolas.barb...@gmail.com:

 Especially useful would be to know whether interleaving a small number
 of TID ordered streams (as would probably be generated by parallel
 scans/processing) would result in an ordering that performs
 significantly worse or not. I assume (but cannot prove) that in this
 case the OS will understand the read pattern as being multiple streams
 and prefetching will work correctly.

OTOH, that is probably only true when there are a large number of
duplicate keys. Otherwise the order within each (small) group will
appear random, which may or may not result in a significant
performance drop. This probably also depends on whether fadvise (or
friends) are used.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is it possible to have a fast-write Index?

2015-06-18 Thread Nicolas Barbier
2015-06-05 deavid deavidsed...@gmail.com:

 Mode 3: on aminsert, put the new entry on a second btree; leaving the
 first one untouched. Because the second btree is new, will be small, and
 writes should be faster. When doing a index scan, read tuples from both at
 same time (like merge sort). On vacuum, merge the second btree onto the
 first. On this mode, the index is sorted and there's no need of recheck.

You might be interested in reading the thread “Fast insertion indexes:
why no developments” (2013), starting at
1383033222.73186.yahoomail...@web172602.mail.ir2.yahoo.com .

That thread talks mostly about reducing the (delayed) I/O caused by
inserting in a super-big index at a continuously high rate, while you
seem more interested in the delay problem caused by the CPU time used
when inserting in multiple indexes (which should be quite fast
already, as most of the writing is delayed).

Your problem (if it is really about delay and not about continuous
throughput) might be better served by a delay-reducing solution such
as writing a logical “row X is inserted, please make sure that all
indexes are up-to-date” to the WAL, instead of the physical “row X is
inserted into table A, part Y of index Z must be updated like this,
part Q of index S must be updated like so, etc” as is done now.
Updating the indexes (not only the writing itself) would then be
performed in a delayed fashion. Reading of an index must always
additionally scan the in-memory queue of logical WAL records that is
kept.

Of course, switching the WAL wholesale from a physical description of
the changes that must be performed to a logical description is
probably not feasible. Therefore, one could think about some new kind
of “logical WAL” that gets logged separately (or even mixed in with
the normal, physical WAL), where first the logical WAL is written
(“insert row X in table A”), after which other operations can continue
and the logical WAL is converted to physical WAL asynchronously (by
“performing” the changes as is currently done, but by a background
process). Recovery then would first need to replay the physical WAL,
and then replay the logical WAL.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Patch to add functionality to specify ORDER BY in CREATE FUNCTION for SRFs

2015-01-07 Thread Nicolas Barbier
2015-01-05 Tom Lane t...@sss.pgh.pa.us:

 What would make sense to me is to teach the planner about inlining
 SQL functions that include ORDER BY clauses, so that the performance
 issue of a double sort could be avoided entirely transparently to
 the user.

Another way of getting to the point where the extra check-node is not
needed in obvious cases, would be:

* Apply the current patch in some form.
* Later, add code that analyzes the query inside the function. If it
turns out that the result of the analysis implies the declared order,
don't add the check-node.

The analysis can in principle also be performed for other languages,
but that would most likely be way more complex for the typical Turing
complete languages.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views don't show up in information_schema

2014-10-18 Thread Nicolas Barbier
2014-10-18 Stephen Frost sfr...@snowman.net:

 * Peter Eisentraut (pete...@gmx.net) wrote:

 More subtly, if we claim that a materialized view is a view, then we
 cannot have asynchronously updated materialized views, because then we
 have different semantics.

 This is, at least, a reason I can understand, though I'm not sure I see
 it as sufficient to say matviews are so different from views that they
 shouldn't be listed as such.

Maybe it's useful to try to imagine who the actual consumers of
INFORMATION_SCHEMA are? It's probably mostly generic SQL tools that
try to provide a list of objects with corresponding operations (drop,
alter, dump to a file, show contents, etc) or tools that provide
syntax completion.

I can only imagine two practical (though rather hypothetical) problems
caused by the semantical mismatch between normal views and
possibly-outdated matviews:

(1) Accidentally seeing outdated data: Something tries to replace part
of a query with a reference to a matview, because INFORMATION_SCHEMA
says that the definition of the view is such-and-so. This causes the
resulting query to possibly see outdated data.

(2) Accidentally seeing data that is too new: Something replaces a
reference to a matview in a query with the defining query of the
matview. This causes the resulting query to possibly see data that is
too new, assuming that the original query author is trying to rely
on the outdatedness of the matview in the vein of I want to see
yesterday's data. I personally consider relying on the outdatedness
of a matview to be bad design; Maybe that should be mentioned in the
documentation if I'm not the only one thinking that way.

(Note that (2) also happens when a generic SQL tool regenerates a
schema by recreating a matview as a normal view. The resulting normal
view seems to contain data that is too new.)

Those problems sound so far-fetched, that I suggest putting matviews
(even though they may be out-of-date) in INFORMATION_SCHEMA.VIEWS as
if they were normal views, so that in all other use cases (i.e., the
abovementioned generic SQL tools), the right thing happens. It is
probably useful to put them in INFORMATION_SCHEMA.TABLES with a
specialized type MATERIALIZED VIEW (or somesuch), so that tools that
know about the existence of matviews know how to make the difference.

Does someone know what other DBMSs do in this regard? I.e., do they
put anything in INFORMATION_SCHEMA.VIEWS for matviews? What TABLE_TYPE
do they use in INFORMATION_SCHEMA.TABLES?

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views don't show up in information_schema

2014-10-17 Thread Nicolas Barbier
2014-10-16 Stephen Frost sfr...@snowman.net:

 Alright, coming back to this, I have to ask- how are matviews different
 from views from the SQL standard's perspective?

Matviews that are always up to date when you access them are
semantically exactly the same as normal views. Matviews that can get
out of date, however, are not.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minmax indexes

2014-08-07 Thread Nicolas Barbier
2014-08-07 Oleg Bartunov obartu...@gmail.com:

 +1 for BRIN !

+1, rolls off the tongue smoothly and captures the essence :-).

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minmax indexes

2014-08-06 Thread Nicolas Barbier
2014-08-06 Claudio Freire klaussfre...@gmail.com:

 So, I like blockfilter a lot. I change my vote to blockfilter ;)

+1 for blockfilter, because it stresses the fact that the physical
arrangement of rows in blocks matters for this index.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 release notes

2014-05-06 Thread Nicolas Barbier
2014-05-05 Bruce Momjian br...@momjian.us:

 On Mon, May  5, 2014 at 10:40:29AM -0700, Josh Berkus wrote:

 * ALTER SYSTEM SET

 Lemme know if you need description text for any of the above.

 OK, great!  Once I have the markup done, I will beef up the descriptions
 if needed and copy the text up to the major items section so we have
 that all ready for beta.

“Add SQL-level command ALTER SYSTEM command [..]”

Using “command” twice sounds weird to my ears. Wouldn’t leaving out
the second instance be better?

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: iff - if

2014-04-17 Thread Nicolas Barbier
2014-04-17 Michael Paquier michael.paqu...@gmail.com:

 Is there no equivalent in German? For example in French there is ssi.

gdw (genau dann, wenn)

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for Merge Join for Non '=' Operators

2014-04-09 Thread Nicolas Barbier
2014-04-09 Dilip kumar dilip.ku...@huawei.com:

 I would like to propose a New merge join algorithm for optimizing non ‘=’
 operators. (‘’, ‘=’, ‘’, ‘=’)

Do you have a real-world example use case of such joins, to offset the
extra planner time that will likely have to be paid (even for queries
for which the functionality ends up not being used)?

I guess there might be queries that join on “values that are not too
far apart” or something, but as those cases (there typically not being
a lot of “inner” rows that join with each “outer” row) are probably
executed efficiently using a nested loop + index scan, I don’t see the
point (yet). Are you aiming for the case where the inner relation is
difficult to compute and cannot be accessed using an index scan?

 selecting this new cost calculation can be implemented in planner

Hmm. Of course, the difficult part will be adding support for this in
the planner, but you don’t seem to have any plans for implementing
that?

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-02 Thread Nicolas Barbier
2013/12/15 David Rowley dgrowle...@gmail.com:

 I've been working on speeding up aggregate functions when used in the
 context of a window's with non fixed frame heads.

 1. Fully implement negative transition functions for SUM and AVG.

I would like to mention that this functionality is also extremely
useful to have for the incremental maintenance of materialized views
that use aggregation (which IMHO is one of the most useful kinds).

(Simply imagine a view of the form “SELECT a, agg_function(b) FROM
table GROUP BY a”, a lot of rows in the table, a lot of rows in each
group, and changes that both remove and add new rows.)

For this to work, two things are needed:

(1) A way to apply a value normally (already supported) and inversely
(i.e., this patch) to the current “internal state” of an aggregation.

(2) A way to store the “internal state” of an aggregation in the
materialized view’s “extent” (i.e., the physical rows that represent
the view’s contents, which may or may not be slightly different from
what you get when you do SELECT * FROM matview). As (AFAIK) that state
is stored as a normal value, the maintenance code could just take the
value, store it in the extent, and next time retrieve it again and
perform normal or inverse transitions. When selecting from the
matview, the state could be retrieved, and the final function applied
to it to yield the value to be returned.

To understand (2), assume that one wants to store an AVG() in a
materialized view; To be able to update the value incrementally, one
needs to actually store the SUM() and COUNT(), and perform the
division when selecting from the materialized view. Or it could
(initially) be decided to define AVG() as “not supporting fast
incremental maintenance,” and require the user (if he/she wants fast
incremental maintenance) to put SUM() and COUNT() in the materialized
view manually, and perform the division manually when wanting to
retrieve the average.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-13 Thread Nicolas Barbier
2013/11/12 Claudio Freire klaussfre...@gmail.com:

 On Tue, Nov 12, 2013 at 6:41 PM, Nicolas Barbier
 nicolas.barb...@gmail.com wrote:

 (Note that K B-trees can be merged by simply scanning all of them
 concurrently, and merging them just like a merge sort merges runs.
 Also, all B-trees except for the first level (of size S) can be
 compacted 100% as there is no need to reserve space for further
 insertions in them.)

 Unless you can guarantee strong correlation of index-order vs
 physical-order, scanning multiple indexes in index-order will be quite
 slow (random I/O).

As all the bigger trees are written in one pass (as the result of a
merge of multiple smaller trees), I would assume that it is rather
easy to guarantee it for them.

As for the smallest trees (size S), I think it doesn’t matter much as
they “fit easily in memory.” Initially I would say that redefining it
so that K of them (rather than 1) must still fit in memory is the easy
fix.

A future optimization could alleviate the need for the redefinition
(and would also improve normal B-tree indexes): Somehow make sure that
smaller trees (that fit in memory) are typically written out
more-or-less in the right order. For that, one could for example
postpone determining the ultimate block-order to write-out time. This
is similar to what Reiser4 calls “dancing trees,” but unfortunately
requires some rejiggering of the abstraction layers on PostgreSQL (I
think). Having deferred insertion (which is probably way easier to
implement) could conceivably also improve things.

URL:https://en.wikipedia.org/wiki/Dancing_tree

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-13 Thread Nicolas Barbier
2013/11/12 Simon Riggs si...@2ndquadrant.com:

 On 12 November 2013 21:41, Nicolas Barbier nicolas.barb...@gmail.com wrote:

 Look-up speed is as follows: Each look-up must look through all
 B-trees.

 That can be optimised by using a min max approach, so we need only
 look at sub-trees that may contain data.

Under the assumption that the data are really *really* inserted in
random order, I think that min max indexes would not help much: All
subtrees would contain so many different values that the boundaries
will almost never be narrow enough to exclude scanning it (unless one
is searching for outliers).

I think that min max indexes are very useful for data that is inserted
in a some less-than-random way: When rather large swathes of inserted
data fall in between boundaries that a lot of other data doesn’t fall
in between. For min max index scans to be fast, the data doesn’t
exactly have to be ordered in the heap (that’s one of the advantages
vs. B-trees, where a different order in the index and the heap is very
bad for scans of any significant part of the index), but it can also
not be entirely arbitrary.

I would say that min max indexes should be used in either of the
following cases (and probably some more that I don’t think of right
now):

* When the data is inserted close to ordered (i.e., past or future
dates that have a tendency to be correlated with “the current day,”
such as invoice dates). For this usecase, a normal B-tree would also
work, but would take more space and require more heavy maintenance.
* When large batches of “similar” data (fitting in between boundaries
that are more narrow than the “global boundaries”) are inserted at a
time, even when multiple of such batches arrive in random order.
* When insertion is up to entirely random, but the goal is to optimize
look-ups for (relatively rare) outliers.
* (combined with any the above to actually make the index *useful*)
When needing a lot a indexes on the same data or having a very high
rate of insertion, and therefore the maintenance burden matters a lot.

 I would add that it is possible to optimise large DELETEs from a table
 if complete sub-trees of the btree can be easily removed. This for me
 would be the compelling benefit of this approach.

Idem WRT the randomness: If the data are really deleted in a
completely random fashion (e.g., Leonardo might want to delete all
phone calls in a certain year, while the index is on phone number),
whole subtrees would almost never become candidates for deletion (the
same problem applies to a normal min max index). Of course, everything
would change if the data is not really deleted randomly.

The same usecases as mentioned above (replace “insertion” with
“deletion”) seem to apply for deletion in min max indexes.

Note that B-forests as described before don’t work well for a high (or
even not-so-high) rate of deletions: During VACUUM the updates to the
bigger trees would kill all performance similar to how a high rate of
insertion kills the performance of normal B-trees once they get big.
To remedy this, one could adds stuff such as “B-trees of deleted
entries” (i.e., negative trees) that may then afterwards be merged
with other such “negative” trees + “positive” trees. Look-ups would
need to take all those trees into account.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-12 Thread Nicolas Barbier
2013/11/2 Simon Riggs si...@2ndquadrant.com:

 On 29 October 2013 16:10, Peter Geoghegan p...@heroku.com wrote:

 Presumably someone will get around to implementing a btree index
 insertion buffer one day. I think that would be a particularly
 compelling optimization for us, because we could avoid ever inserting
 index tuples that are already dead when the deferred insertion
 actually occurs.

 That's pretty much what the LSM-tree is.

[ Disclaimer: I have only skimmed the LSM-trees paper rather than read
it thoroughly. ]

LSM-trees seem to hit a wall when the total amount of data gets big
enough, unless one uses “multi-component LSM-trees” (as described in
the paper). Having a B-tree with deferred insertion would be similar
to having an LSM-tree without the multi-component property.

The underlying problem with fast random insertions into a B-tree is
that each write of a whole block writes only a small amount of “useful
changes” (once the tree gets large enough relative to memory). The
“multi-component” concept fixes that. I think that simply combining
that concept with normal B-trees is a rather elegant way of (at least
theoretically) solving the problem:

Definitions:

* Define a rather small integer K (with K ≥ 2), that will influence
maximum insertion speed (higher K = higher insertion speed), but also
look-up time (higher K = slower look-up).
* Define some size S “that easily fits in memory.”
* F is the fan-out of the B-tree. (If F  K, the algorithm results in
slower amortized insertion speed than simply using one single big
B-tree if only the amount of blocks read/written are taken into
account; it may still be better because of seek times.)
* n is the total number of entries in the index.

Algorithm:

* Start inserting into a small B-tree until it gets size S, then start
inserting into a new B-tree until that fills up to size S, etc.
* When K such B-trees (of size S) exist, merge them together into one
(S × K)-sized B-tree (delete the old ones).
* Do this recursively: Once K B-trees of size (K × S) exist, merge
them together into one (S × K^2)-sized B-tree, etc.
* Let each look-up walk all trees, and return the union of all results.

(Note that K B-trees can be merged by simply scanning all of them
concurrently, and merging them just like a merge sort merges runs.
Also, all B-trees except for the first level (of size S) can be
compacted 100% as there is no need to reserve space for further
insertions in them.)

Insertion speed can be calculated as follows (I disregard seek times
to make this easy; it also happens to be the right assumption for
SSDs): Assume that a small B-tree (size S) can simply be written out
without having to write each block multiple times. Each entry has to
be written log_K(n × log_F(n)) times. All blocks written are 100%
“useful changes.” Therefore, insertion speed is log_K(n × log_F(n))
times less than raw disk speed. (Note that I also disregard the time
needed for reading; This may make everything about twice as slow.)

Example: For K = 10, F = 100 (i.e., 80B per entry), blocksize = 8kB,
and n = 10^9 (i.e., 80GB of entries), the insertion speed is
log_10(10^9 × log_100(10^9)) = log_10(10^9 × 4.5) = ~9.5 times slower
than writing the 80GB of index entries sequentially. For the “one
single big B-tree” scenario, the insertion speed would be ~F = ~100
times slower than raw disk speed (assuming that all writes but the
ones to the leaf-level can be combined).

Look-up speed is as follows: Each look-up must look through all
B-trees. Assume for simplicity that all trees have the same height as
the single B-tree in the “one single big B-tree” case (i.e., which is
rather wrong (most are less tall), but seems good enough as an
estimate), there are K trees of each size (idem), and there are
log_K(n) different sizes. Then, the number of trees to walk is K ×
log_K(n), and therefore each look-up is K × log_K(n) slower than in
the “one single big B-tree” case.

Example: (using the same parameters as above) Look-up speed is 10 ×
log_10(10^9) = 90 times slower (i.e., because there are ~90 trees).

Index size: I think (didn’t calculate) that the combined size of the
B-trees will be about the same as (a little bit more than) the size of
a single big B-tree containing the same entries.

I have no idea whether someone already gave this kind of “forest of
B-trees” structure a name. Otherwise, I suggest “B-forest” :-).

In conclusion, use a “B-forest” when:

* The index entries are small (large fan-out).
* The insertion throughput is high.
* It’s OK for look-ups to be slow.
* Extra points when the storage medium has high seek times.

Major missing piece in PostgreSQL (I think):

* Functionality to merge K indexes into one (bigger) combined index.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-12 Thread Nicolas Barbier
2013/11/12 Nicolas Barbier nicolas.barb...@gmail.com:

 In conclusion, use a “B-forest” when:

 * The index entries are small (large fan-out).
 * The insertion throughput is high.
 * It’s OK for look-ups to be slow.
 * Extra points when the storage medium has high seek times.

Oops, forgot the most important ones:

* Insertions are random.
* The total amount of data is very high.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Chinese in Postgres

2013-08-18 Thread Nicolas Barbier
[ Could you please trim your citations, i.e., please don’t top-post:
https://en.wikipedia.org/wiki/Posting_style#Top-posting ]

2013/8/16 Francesco ciifrance...@tiscali.it:

 Thanks for your answer.
 Yes, the client is also UTF8:

 MyDB=# show
 client_encoding;
 client_encoding
 -
 UTF8
 (1 row)

I guess that this is the client encoding used by psql. I suspect your
C++-program doesn’t use client encoding UTF8. What library are you
using, libpq? Did you run the psql instance (whose output you pasted)
on Windows or on some kind of UNIX-machine over SSH? Does your
problematic C++-program run on Windows or the UNIX-machine?

(The “client encoding” is not a property of the database, but of the
specific client you are using. The C++-program’s client encoding might
therefore by entirely different from the one used by psql, especially
if you don’t run them on the same machine.)

[ BTW, I think this question really doesn’t belong on -hackers, as
no-one seems to think it is a bug, nor is it a question about
PostgreSQL internals. ]

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation/help for materialized and recursive views

2013-06-27 Thread Nicolas Barbier
2013/6/27 Magnus Hagander mag...@hagander.net:

 Is there a particular reason why CREATE RECURSIVE VIEW is part of the
 help for CREATE VIEW, but CREATE MATERIALIZED VIEW doesn't show up
 there?

 I realize the technical reason (they're different man pages, and that
 also controls what's in \h in psql which is where I ran into it), but
 was there any particular reason to split those up in the first place?

Normal views are an abstraction layer, while materialized views
(despite containing the word “view”) are mainly a performance tool (in
a way similar to indexes).

The functionality of materialized views will (over time) totally swamp
that of normal views, so mixing all the corresponding documentation
with the documentation for normal views probably doesn’t make things
easier for people that are only interested in normal views.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Nicolas Barbier
2013/6/26 Heikki Linnakangas hlinnakan...@vmware.com:

 On 26.06.2013 16:41, Yuri Levinsky wrote:

 Heikki,
 As far as I understand the height of the btree will affect the number of
 I/Os necessary. The height of the tree does not increase linearly with
 the number of records.

 Now let's compare that with a hash partitioned table, with 1000 partitions
 and a b-tree index on every partition. [..] This is almost equivalent to
 just having a b-tree that's one level taller [..] There certainly isn't
 any difference in the number of actual I/O performed.

Imagine that there are a lot of indexes, e.g., 50. Although a lookup
(walking one index) is equally fast, an insertion must update al 50
indexes. When each index requires one extra I/O (because each index is
one level taller), that is 50 extra I/Os. In the partitioned case,
each index would require the normal smaller amount of I/Os. Choosing
which partition to use must only be done once: The result “counts” for
all indexes that are to be updated.

Additionally: Imagine that the data can be partitioned along some
column that makes sense for performance reasons (e.g., some “date”
where most accesses are concentrated on rows with more recent dates).
The other indexes will probably not have such a performance
distribution. Using those other indexes (both for look-ups and
updates) in the non-partitioned case, will therefore pull a huge
portion of each index into cache (because of the “random distribution”
of the non-date data). In the partitioned case, more cache can be
spent on the indexes that correspond to the “hot partitions.”

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Nicolas Barbier
2013/6/27 Nicolas Barbier nicolas.barb...@gmail.com:

 When each index requires one extra I/O (because each index is
 one level taller), that is 50 extra I/Os. In the partitioned case,
 each index would require the normal smaller amount of I/Os.

[..]

 Using those other indexes (both for look-ups and
 updates) in the non-partitioned case, will therefore pull a huge
 portion of each index into cache (because of the “random distribution”
 of the non-date data). In the partitioned case, more cache can be
 spent on the indexes that correspond to the “hot partitions.”

It seems that the system described by Claudio fixes this problem another way:

Claudio wrote:

 Now I just have two indices. One that indexes only hot tuples, it's
 very heavily queried and works blazingly fast, and one that indexes by
 (hotness, key).

Yuri, maybe that is something you should investigate instead of partitioning?

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Nicolas Barbier
2013/6/27 Markus Wanner mar...@bluegap.ch:

 On 06/27/2013 11:12 AM, Nicolas Barbier wrote:

 Imagine that there are a lot of indexes, e.g., 50. Although a lookup
 (walking one index) is equally fast, an insertion must update al 50
 indexes. When each index requires one extra I/O (because each index is
 one level taller), that is 50 extra I/Os. In the partitioned case,
 each index would require the normal smaller amount of I/Os. Choosing
 which partition to use must only be done once: The result “counts” for
 all indexes that are to be updated.

 I think you're underestimating the cost of partitioning. After all, the
 lookup of what index to update for a given partition is a a lookup in
 pg_index via pg_index_indrelid_index - a btree index.

I am assuming that this (comparatively very small and super-hot) index
is cached all the time, while for the other indexes (that are
supposedly super-huge) only the top part stays cached.

I am mostly just trying to find out where Yuri’s “partitioning is
needed for super-huge tables” experience might come from, and noting
that Heikki’s argument might not be 100% valid. I think that the
“PostgreSQL-answer” to this problem is to somehow cluster the data on
the “hotness column” (so that all hot rows are close to one another,
thereby improving the efficiency of the caching of relation blocks) +
partial indexes for the hot rows (as first mentioned by Claudio; to
improve the efficiency of the caching of index blocks).

 Additionally, the depth of an index doesn't directly translate to the
 number of I/O writes per insert (or delete). I'd rather expect the avg.
 number of I/O writes per insert into a b-tree to be reasonably close to
 one - depending mostly on the number of keys per page, not depth.

My reasoning was: To determine which index block to update (typically
one in both the partitioned and non-partitioned cases), one needs to
walk the index first, which supposedly causes one additional (read)
I/O in the non-partitioned case on average, because there is one extra
level and the lower part of the index is not cached (because of the
size of the index). I think that pokes a hole in Heikki’s argument of
“it really doesn’t matter, partitioning == using one big table with
big non-partial indexes.”

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-17 Thread Nicolas Barbier
2013/6/17 Heikki Linnakangas hlinnakan...@vmware.com:

 +errmsg(concurrent refresh requires a
 unique index on just columns for all rows of the materialized view)));

Maybe my english is failing me here, but I don’t understand the “just” part.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] counting algorithm for incremental matview maintenance

2013-05-17 Thread Nicolas Barbier
2013/5/17 Kevin Grittner kgri...@ymail.com:

 During calculation of the deltas to apply to the matviews, it must
 be possible to query the referenced tables from the perspective of
 both the before and after versions of the data.

[..]

 I don't think the process applying the deltas needs to do
 anything unusual or special about acquiring snapshots as it
 processes the delta from each transaction, unless we see some way
 to optimize things once we're hip-deep in the details.

Note that the basic count algorithm assumes real-serializable
transactions for correctness. Example:

* The example matview’s defining query is a simple join: SELECT * FROM
A JOIN B ON A.a = B.b.
* For simplicity, assume that the matview is initially empty.
* Assume a non real-serializable isolation mode, e.g., REPEATABLE READ.
* Assume that there are two transactions T1 and T2 running in
parallel, that don’t see each other’s changes.
* Both T1 and T2 insert a row in A and B, respectively, in such a way
that the combination of the rows is matched by the join condition
(i.e., the row inserted by T1 has A.a equal to the B.b of the row
inserted by T2).
* The count algorithm will not add anything to the matview as part of
the execution of T1 (because the new row in A doesn’t join with any
rows in B that T1 observes).
* Idem for T2 (nothing will be added).
* After T1 and T2 completed, a new transaction will see an
inconsistency: A and B contain rows that are supposed to result in a
row in the matview, but it doesn’t contain any.

Each of the following would fix this problem:

(1) Real-serializability must be required for all transactions that
update any base table of a matview’s that is maintained using the
count algorithm.
(2) The matview’s defining query and the corresponding base tables
must adhere to certain (very commonly satisfied) conditions (for
example, if there is an FK between A.a and B.b, the example cannot be
made to fail; The FK must not be deferred in the case of
right-after-each-statement matview maintenance).
(3) The count algorithm must be implemented in a way that understands
MVCC internals: Reading the base tables must be done using a technique
that reads all rows (i.e., also the ones not visible to the current
transaction), and the xmin/xmax/etc of the updated rows in the matview
must be set in a smart way, so as to yield visibility results that
correspond to the visibility of the “originating” rows in the base
tables. Calculation of the matview deltas (especially the part where
the base tables are inspected for joining rows) must in this case
probably be done in a serialized manner.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] counting algorithm for incremental matview maintenance

2013-05-17 Thread Nicolas Barbier
2013/5/17 Kevin Grittner kgri...@ymail.com:

 Nicolas Barbier nicolas.barb...@gmail.com wrote:

 Note that the basic count algorithm assumes real-serializable
 transactions for correctness. Example:

[..]

 Good point.

 It might be hard to detect when this type of race condition exists,
 since it could be hidden inside a function.  Any thoughts on that?

I think that matviews with defining queries that use functions for
which the system can not prove whether they are safe for the given
incremental maintenance technique (e.g., count algorithm) and
situation (e.g., mandatory isolation level), should not be allowed to
use incremental maintenance of that type.

Although I have only limited practical matview experience (with SQL
Server (2005?) only, there it is called “indexed views”): I assume
that in most other DBMSs, the queries that can be used for defining
incrementally maintained matviews are very limited (they are in SQL
Server).

 Each of the following would fix this problem:

 (2) The matview’s defining query and the corresponding base tables
 must adhere to certain (very commonly satisfied) conditions (for
 example, if there is an FK between A.a and B.b, the example cannot be
 made to fail; The FK must not be deferred in the case of
 right-after-each-statement matview maintenance).

 Agreed.  That would cover a high percentage of cases regardless of
 transaction isolation level.  Knowing when you needed such a
 constraint to make incremental maintenance safe would be hard,
 though.

Reasoning about it becomes easier when you have a clean definition of
which queries you accept, and which ones you don’t (see below). Then
you can look at your definition and say “I allow X because it is
always correct  I don’t allow Y because I didn’t prove it to be always
correct (yet).”

 (3) The count algorithm must be implemented in a way that understands
 MVCC internals: Reading the base tables must be done using a technique
 that reads all rows (i.e., also the ones not visible to the current
 transaction), and the xmin/xmax/etc of the updated rows in the matview
 must be set in a smart way, so as to yield visibility results that
 correspond to the visibility of the “originating” rows in the base
 tables. Calculation of the matview deltas (especially the part where
 the base tables are inspected for joining rows) must in this case
 probably be done in a serialized manner.

 I will look at this option some more, but on the face of it, I'm
 not quite following how it would work; and it sounds invasive and
 fragile. If you know of any paper on this approach you could point
 me to, or if you could sketch it out in a little more detail, I
 would appreciate that.

Sorry, these are my own musings: I have never found any paper about it.

After a bit more thought, I think it wouldn’t be so easy for the count
algorithm: It would be necessary to express (using physical MVCC rows)
something like “you can only see this (logical) row if both
transaction T1 and T2 are visible to you.” Disjunction would be easy:
create two physical rows with the same data, one that is visible if T1
is visible, and one that is visible if T2 is visible. Unfortunately,
we need conjunction here :-(.

(Upgrading the expressiveness of the MVCC-related information in the
rows would help, but as that is more invasive than anyone wants to
think about, I restrain those thoughts to the confines my own head
;-).)

OTOH, I think that this technique could be used in the case of simple
“aggregation” matviews (without joins, or *maybe* when all joins are
constrained by FKs). (I assume an implementation such as the one that
I sketched in a previous post[1] a few months back.)

[1] 
URL:http://www.postgresql.org/message-id/cap-rdty+vz8t_8jbvsksdrdz-_wr5ef80m-_ukgu5bhw+rk...@mail.gmail.com

 There will, at least for the next several releases, be matviews
 defined with queries for which we cannot support incremental
 maintenance -- REFRESH will continue to be the only way to deal
 with these until we enhance incremental maintenance to support them.
 For example, in 9.4 I don't intend to support incremental
 maintenance of matviews defined with recursive queries.  To me it
 makes sense to prohibit incremental maintenance of a matview for
 which there is no strategy to deal with concurrency problems.

+1

I think you need a strict definition, specific to your incremental
maintenance algorithm, of which queries you accept and which ones you
don’t. Such a definition could be based on SQL syntax, or on any of
the intermediate forms that are produced by the
parser/rewriter/analyzer/planner/etc. I think that basing your
definition on a later processing stage (e.g., after subqueries are
flattened, after views are expanded, or after “redundant” predicates
are removed) would increase the amount of queries that are allowed,
but might make it somewhat more difficult for a user to predict
whether a query allows incremental mantenance or not. (It might also
make your definition

Re: [HACKERS] counting algorithm for incremental matview maintenance

2013-05-17 Thread Nicolas Barbier
2013/5/17 Claudio Freire klaussfre...@gmail.com:

 On Fri, May 17, 2013 at 4:25 PM, Kevin Grittner kgri...@ymail.com wrote:

 (3) The count algorithm must be implemented in a way that understands
 MVCC internals: Reading the base tables must be done using a technique
 that reads all rows (i.e., also the ones not visible to the current
 transaction), and the xmin/xmax/etc of the updated rows in the matview
 must be set in a smart way, so as to yield visibility results that
 correspond to the visibility of the “originating” rows in the base
 tables. Calculation of the matview deltas (especially the part where
 the base tables are inspected for joining rows) must in this case
 probably be done in a serialized manner.

 I will look at this option some more, but on the face of it, I'm
 not quite following how it would work; and it sounds invasive and
 fragile.

 I don't believe it would be that problematic if deltas:

 1 - contains all changes, matching join conditions or not, that could
 potentially alter the matview. This includes the example's alterations
 since the columns being altered were part of the matview's definition,
 but it would not cover updates to columns that were not part of the
 definition (ie: not involved in the join or the selected rows).
 2 - each update is marked with its xid

 Then, serialization can be achieved by only applying committed xids,
 discarding rolled back xids, and evaluating join satisfaction only
 during the updates, and not during delta logging.

I think that your idea of postponing some of the processing to later
commits might indeed make it possible to implement this without
needing the “MVCC-expressiveness enhancement” that I mentioned in my
previous post (by postponing the application to the matview until at
least *someone* must be able to see that particular change). Right now
I have some difficulties working out the MVCC-details of what to do
when joining a base table delta row with the contents of the other
base tables (to create the matview delta), but I’m probably just tired
:-).

In any case, your proposal seems to require the following additional
“infrastructural changes” relative to Kevin’s proposal:

(1) The base table deltas must be shared by all “in flight”
transactions that are in the process of putting stuff in them and
using them to apply changes to the matviews.
(2) Putting stuff in base table deltas must be transactionally safe
(unless all matviews that cares about that specific base table delta
are unlogged).
(3) “Right-after-each-statement matview maintenance” is not possible
(without additional effort).

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Drastic performance loss in assert-enabled build in HEAD

2013-04-05 Thread Nicolas Barbier
2013/4/5 Noah Misch n...@leadboat.com:

 On Thu, Apr 04, 2013 at 12:28:01PM +0200, Nicolas Barbier wrote:

 +1. Having unlogged matviews without having incremental updates yet,
 isn't super useful anyway.

 I would have surmised the opposite: since an unlogged MV requires a full
 refresh at unpredictable moments, logged MVs will be preferred where a refresh
 is prohibitively expensive.

That sounds like a good reason to choose your matview to be logged indeed.

I.e., very expensive to rebuild → choose logged

The opposite is also true: If your matview is not so expensive to
rebuild, why would it matter that much if it is logged? (Rebuilding
would be a tad slower, but it is not that slow to start with, so who
cares?)

I.e., not so expensive to rebuild → logged or unlogged are fine

This would mean “always choose logged,” except for the restricted case
of “incremental updates of a matview that is not so expensive to
rebuild” that I describe next:

 Why might unlogged-MV applications desire incremental updates more acutely
 than logged-MV applications?

My reasoning was more like: If you have incremental updates, there
will probably be some overhead linked to executing any transaction
that updates the base tables, namely for storing the changesets
somewhere. I imagined it could at least be this storing of changesets
that one would want to be unlogged, lest it slowing down the commit of
most transactions that don’t even touch the matview.

(Note that losing any changeset is the same as losing the contents of
the whole matview in the “always guaranteed to be 100% up-to-date when
read” case.)

Of course, because of the previous reasoning, we should always make a
matview logged if it is very expensive to rebuild. That leaves the
case of a not-so-expensive matview: It is smart to make it unlogged
when the changesets are stored in a way similar to what I just
described.

Anyway, I conclude that (especially as we don’t have incremental
updates yet), unlogged matviews are, as things stand now, not very
useful.


As a sidenote, I see two ways to avoid storing changesets as part of a
commit that changes the base tables. Choosing any of those would
invalidate my previous logic, and even more diminish the need for
unlogged matviews:

(1) WAL is used as the source for the changesets; Andres’ logical
replication work comes to mind. Probably mostly useful for the “always
guaranteed to be 100% up-to-date when read” case.

(2) The base tables are scanned and the xmin/xmax values are used to
determine what changed since last time. This probably requires keeping
VACUUM from removing rows that are still needed to determine how to
change any matviews that depend on that base table. Probably mostly
useful for the case where bulk-incremental updates are performed only
sporadically, and where the base tables are not so big that they
cannot usefully be scanned in full.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Drastic performance loss in assert-enabled build in HEAD

2013-04-04 Thread Nicolas Barbier
2013/4/3 Tom Lane t...@sss.pgh.pa.us:

 Kevin Grittner kgri...@ymail.com writes:

 To be honest, I don't think I've personally seen a single use case
 for matviews where they could be used if you couldn't count on an
 error if attempting to use them without the contents reflecting a
 materialization of the associated query at *some* point in time.

 Well, if we remove the WITH NO DATA clause from CREATE MATERIALIZED
 VIEW, that minimum requirement is satisfied no?

An argument against that is that computing the contents may be very expensive.

 Granting that throwing an error is actually of some use to some people,
 I would not think that people would want to turn it on via a command
 that throws away the existing view contents altogether, nor turn it off
 with a full-throated REFRESH.  There are going to need to be ways to
 incrementally update matviews, and ways to disable/enable access that
 are not tied to a complete rebuild, not to mention being based on
 user-determined rather than hard-wired criteria for what's too stale.
 So I don't think this is a useful base to build on.

Am I correct when I think that you are saying here, that the “zero
pages == unscannable” logic is not very future-proof? In that case I
concur, and I also think that this knowledge leaks in way too many
other places (the VACUUM bug mentioned by Kevin is a good example).

 If you feel that scannability disable is an absolute must for version 0,
 let's invent a matview reloption or some such to implement it and let
 users turn it on and off as they wish.  That seems a lot more likely
 to still be useful two years from now.

(In the context of making an unlogged matview unscannable after a crash:)

Is it imaginable that such a reloption could (in a future
implementation) be changed during or right after crash recovery? For
example, by storing the set of “truncated by crash recovery” relations
in a shared catalog table, which is then inspected when connecting to
a database to continue the truncation (in the case of a matview by
making it unscannable)?

 And if you're absolutely convinced that unlogged matviews mustn't work as I
 suggest, we can lose those from 9.3, too.

+1. Having unlogged matviews without having incremental updates yet,
isn’t super useful anyway.

 What I'd actually rather see us spending time on right now is making
 some provision for incremental updates, which I will boldly propose
 could be supported by user-written triggers on the underlying tables
 if we only diked out the prohibitions against INSERT/UPDATE/DELETE on
 matviews, and allowed them to operate on a matview's contents just like
 it was a table.  Now admittedly that would foreclose allowing matviews
 to be updatable in the updatable-view sense, but that's a feature I
 would readily give up if it meant users could build incremental update
 mechanisms this year and not two years down the road.

Please make the syntax for updating the “extent” (physical
representation) of a matview different from updating the view’s
logical contents. Examples:

(1) Require to use a special function to update the extent:

SELECT pg_mv_maintain('INSERT INTO example_matview ...');

While parsing the INSERT, the parser would know that it must interpret
“example_matview” as the matview’s extent; As currently the extent and
the view are the same, nothing must be done except for only allowing
the INSERT when it is parsed in the context of pg_mv_maintain, and
otherwise saying that matviews aren’t updatable yet (“NOTICE: did you
mean to update the extent? in that case use pg_mv_maintain”).

(2) Use a different schema (cf. TOAST) for the extent, e.g., view
“public.example_matview” vs. extent “pg_mv_extent.example_matview”. I
imagine future implementations to possibly require multiple extents
anyway, e.g., for storing the “not yet applied changesets” or other
intermediate things.

 Why exactly do you think that what I'm suggesting would cause a dump and
 reload to not regenerate the data?

Expensiveness: Matviews are used in cases where the data is expensive
to compute.

 We *need* to get rid of that aspect of things.  If you must have
 scannability state in version 0, okay, but it has to be a catalog property
 not this.

+1

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] REFRESH MATERIALIZED VIEW locklevel

2013-03-08 Thread Nicolas Barbier
2013/3/8 Andres Freund and...@anarazel.de:

 On 2013-03-07 15:21:35 -0800, Josh Berkus wrote:

 This limitation is in no way crippling for this feature, or even a major
 detraction.  I still intend to promote the heck out of this feature.

 Thats scaring me. Because the current state of the feature isn't
 something that people expect under the term materialized views and I
 am pretty damn sure people will then remember postgres as trying to
 provide a tick-box item without it being really usable in the real
 world.
 And thats not something I want postgres to be known for.

+1. It seems wise to wait for the feature to ripen some more. That
way, the impact of any promotion will be stronger; Most people
understand “materialized views” to mean something more that what is
currently there.

Of course, a drawback of waiting would be that you might lose the
momentum of the expression “materialized views.” OTOH, any questions
along the lines of “I thought PG supported materialized views since
9.3? Why are they making such a fuss about it now (i.e.,  9.3)?”
would lead to people discussing even more, which might enhance the
effect of the promotion.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-03-07 Thread Nicolas Barbier
2013/3/5 Kevin Grittner kgri...@ymail.com:

 Perhaps it would be worth looking for anything in the patch that
 you think might be painting us into a corner where it would be hard
 to do all the other cool things.  While it's late enough in the
 process that changing anything like that which you find would be
 painful, it might be a lot more painful later if we release without
 doing something about it.  My hope, of course, is that you won't
 find any such thing.  With this patch I've tried to provide a
 minimal framework onto which these other things can be bolted.
 I've tried hard not to do anything which would make it hard to
 extend, but new eyes may see something I missed.

(Without having looked at the patch, or even the documentation :-/.)

I think that something that might prove useful is the following:
Keeping in mind the possibility of storing something in the matview’s
heap that doesn’t correspond to what a SELECT * FROM matview would
yield (i.e., the “logical content”). The transformation could be
performed by an INSTEAD rule (similar to how a view is expanded to its
definition, a reference to a matview would expand to its heap content
transformed to the “logical content”).

(Note that I don’t have any reason to believe that the current
implementation would make this more difficult than it should be.)

ridiculously long rationale for the previous

(All the following requires making matviews (inter- and intra-)
transactionally up-to-date w.r.t. their base tables at the moment of
querying them. I don’t deal with approximate results, however useful
that might be.)

I think that the possibility of optimizing COUNT(*) (see mail by Greg
Stark in this thread with “the queue of updates with transacion
information that are periodically flattened into the aggregate”) can
be generalized to generic aggregation that way. The idea would be that
a transaction that adds (or deletes or updates) a row in a base table
causes a “delta” row version in the matview. Selecting from the
matview then merges these deltas into one value (for each row that is
logically present in the matview). Every once in a while (or rather
quite often, if the base tables change often), a VACUUM-like clean-up
operations must be run to merge all rows that are “old enough” (i.e.,
whose transactions are not in flight anymore).

Example of trivial aggregation matview weight_per_kind defined as:

SELECT kind, SUM(weight) FROM fruit GROUP BY kind;

The matview would then physically contain rows such as:

xmin, xmax, kind, weight
1000, 0, 'banana', 123
1000, 0, 'apple', 1
1001, 0, 'banana', 2
1002, 0, 'banana', -3

Which means:

* tx 1000 probably performed a clean-up operation and merged a bunch
of banana rows together to yield 123; it also inserted an apple of
weight 1.
* tx 1001 inserted a banana of weight 2. Any clean-up operation coming
by could not merge the 2 into the first row, as long as tx 1000 is in
flight. Otherwise, it would yield 125; physically this would mean
adding a 125 row, marking the 123 and 2 rows as deleted, and then
waiting for VACUUM to remove them).
* tx 1002 deleted a banana with weight 3.

The result of a SELECT * FROM weight_per_kind; would actually execute
SELECT kind, SUM_merge(weight) FROM heap_of_weight_per_kind GROUP BY
kind;

This would result, for tx 1001 (assuming tx 1000 committed and our
snapshot can see it), in:

kind, weight
'banana', 125
'apple', 1

(The -3 is not taken into account, because it is not visible to tx 1001.)

The operator to use at the location of SUM_merge is something that
merges multiple aggregation results (plus results that represent some
kind of “negation”) together. For SUM, it would be SUM itself and the
negation would be numerical negation. There might also be some kind of
“difference” concept used for UPDATE: When updating a weight from 4 to
3, the difference would be -1. Those additional properties could
optionally be added to the definition of each aggregation function; It
must be done for each function that you want to use in such a way.

Other aggregation functions such as AVG would require storing the SUM
+ number of rows in the matview (otherwise two AVGs could not be
merged); again a difference between the heap and the logical content.
Functions such as MIN and MAX are more difficult to fit in this
framework: I can only see how it would work if row deletion were not
allowed (which might still be a valid use-case); luckily, I think MIN
and MAX are not the typical things for which you would want to use
matviews, because quick computation can typically be done directly
using the base tables.

This whole thing would result in incremental updates of
aggregation-matviews that don’t require physical serialization of the
transactions that update the base tables and that query the matview,
which other models (that depend on correspondence between the heap and
logical content of matviews) would probably require.

And that’s where I stop rambling because nobody gets this far anyway,

Re: [HACKERS] Materialized views WIP patch

2013-03-05 Thread Nicolas Barbier
2013/3/5 Robert Haas robertmh...@gmail.com:

 All that having been said, it's hard for me to imagine that anyone
 really cares about any of this until we have an incremental update
 feature, which right now we don't.  Actually, I'm betting that's going
 to be significantly harder than automatic-query-rewrite, when all is
 said and done.

I agree.

E.g., things such as keeping a matview consistent relative to changes
applied to the base tables during the same transaction, might be
mightily difficult to implement in a performant way. OTOH, matviews
that can only be used for optimization if their base tables were not
changed “too recently” (e.g., by transactions that are still in
flight, including the current transaction), are probably kind of
useful in themselves as long as those base tables are not updated all
the time.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-03-05 Thread Nicolas Barbier
2013/3/5 Kevin Grittner kgri...@ymail.com:

 Exactly.  I predict that we will eventually have some special sort
 of trigger for maintaining MVs based on base table changes to
 handle the ones that are just too expensive (in developer time or
 run time) to fully automate.  But there is a lot of low-hanging
 fruit for automation.

I think it would be totally OK to restrict the possible definitions
for matviews that can be maintained fully incrementally to something
like:

SELECT attributes and aggregations FROM trivial joins WHERE trivial
condition GROUP BY attributes;

Those definitions are the most useful for optimizing the things that
matviews are good at (joins and aggregation).

Nicolas

PS. Sorry for having fired off this discussion that obviously doesn’t
really relate to the current patch.

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-03-03 Thread Nicolas Barbier
2013/3/3 Kevin Grittner kgri...@ymail.com:

 Rewriting queries using
 expressions which match the MV's query to pull from the MV instead
 of the underlying tables is the exception.  While that is a sexy
 feature, and I'm sure one can construct examples where it helps
 performance, it seems to me unlikely to be very generally useful.
 I suspect that it exists mostly so that people who want to write an
 RFP to pick a particular product can include that as a requirement.
  In other words, I think the main benefit of automatic rewrite
 using an MV is marketing, not technical or performance.

I think that automatically using materialized views even when the
query doesn’t mention them directly, is akin to automatically using
indexes without having to mention them in the query. That way, queries
can be written the natural way, and “creating materialized views” is
an optimization that can be applied by a DBA without having to update
the application queries to use them.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-03-03 Thread Nicolas Barbier
2013/3/3 Kevin Grittner kgri...@ymail.com:

 Nicolas Barbier nicolas.barb...@gmail.com wrote:

 I think that automatically using materialized views even when the
 query doesn’t mention them directly, is akin to automatically
 using indexes without having to mention them in the query. That
 way, queries can be written the natural way, and “creating
 materialized views” is an optimization that can be applied by a
 DBA without having to update the application queries to use them.

 Oh, I understand that concept perfectly well, I just wonder how
 often it is useful in practice.  The cost of planning with indexes
 tends to go up dramatically the planner needs to evaluate all
 possible combinations of access paths.  We've devoted quite a bit
 of energy keeping that from being something like the factorial of
 the number of indexes.  If you now need to find all materialized
 views which could substitute for parts of a query, and look at all
 permutations of how those could be used, and which indexes can be
 used for each of those combinations, you have planning time which
 can explode to extreme levels.

I guess that a basic version of such a feature would do something like this:

(1) Check for each matview whether it simply consists of an optional
bunch of joins + optional aggregation + optional general filter
conditions (to support something akin to a partial index). If not, the
optimization doesn’t take this matview into account. This step can be
done beforehand.
(2) Check for each (sub)query in the query-to-optimize whether it does
the following (at a smart point in the optimization phase). If any of
these conditions are not met, don’t use this matview:
  - Joins at least the tables that are joined in the matview.
  - Contains join conditions and general filter conditions that are at
least as strict.
  - Doesn’t refer elsewhere to any attributes that the matview doesn’t contain.
(3) Always replace the corresponding query parts with the matview
(i.e., assume that the cost will always be lower than performing the
original query).
(4) If multiple matviews fit in step 3, try them all (and use the one
that yields the lower total cost).
(5) Always replace any aggregation with the corresponding
aggregation-results (if they exist) from the matview.

That doesn’t sound as if it would make planning time explode that much
(except, because of step 4, when there are many matviews that contain
overlapping sets of joined tables, and a query joins over the union of
those sets, *and* the replaceable-by-matviews parts are in
subqueries). It could even decrease it significantly (e.g., in the
case where a bunch of joins would be replaced with a scan of a
matview).

Also, I suppose that once such functionality exists, application
writers would be more inclined to write “heavy” queries that do lots
of aggregation even in an OLTP environment, of the kind that is these
days typically only done in OLAP environments.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-02-19 Thread Nicolas Barbier
2013/2/19 Robert Haas robertmh...@gmail.com:

 In the department of crazy ideas, what about having pg_dump NEVER
 refresh ANY materialized views?

 It's true that the job of pg_dump and pg_restore is to put the new
 database in the same state that the old database was in, but I think
 you could make a reasonable case that materialized views ought to be
 an exception.  After all, even with all of this infrastructure,
 chances are pretty good that the new MV contents won't end up being
 the same as the old MV contents on the old server - because the old
 MVs could easily have been stale.  So why not just get the restore
 over with as fast as possible, and then let the user refresh the MVs
 that they think need refreshing (perhaps after getting the portions of
 their system that don't rely on MVs up and running)?

 At the very least, I think we ought to have an option for this
 behavior.  But the more I think about it, the more I think maybe it
 ought to be the default.

+1 from me from a minimalist point of view.

I think of a matview of the manually refreshed kind as “can contain
stale contents (or be invalid) unless someone manually makes sure it
is up to date (or valid)”. Making any matviews invalid by default upon
restoring (itself being a manual action) would be consistent with that
definition. Additionally, ISTM to be the least arbitrary (and hence
most elegant) choice, and even more so in the context of
matviews-depending-on-matviews.

Spamming some more craziness:

Another (more elaborate) suggestion could be: Store for each matview
whether it is to be rebuilt upon restore or not. Using this setting
would intuitively mean something like “I consider this matview being
valid a precondition for considering the database state valid.”
Setting this to true for a matview would only be allowed when any
other matviews on which it depends also have this setting set to true.

Just my €0.02 of course.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Docs: Make notes on sequences and rollback more obvious

2012-08-18 Thread Nicolas Barbier
2012/8/7 Kevin Grittner kevin.gritt...@wicourts.gov:

 I also think it's a problem that one can get through the entire
 Concurrency Control chapter (mvcc.sgml) without a clue that
 sequences aren't transactional.

It is possible to say that they *are* transactional when considering
the following definition: nextval() doesn’t always give you “the” next
value, but “some” next value that is higher than the one gotten by any
preceding transactions.

I personally like it better to introduce this minor complexity in the
definition of sequences, rather than messing with the definition of
transactionality.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing optimization when filters are applied after window functions

2012-05-17 Thread Nicolas Barbier
2012/5/17 Volker Grabsch v...@notjusthosting.com:

 Also, is there any chance to include a (simple) attempt of
 such an optimiztation into PostgreSQL-9.2 beta, or is this
 only a possible topic for 9.3 and later?

For 9.2, you’re about 4 months late :-). The last commitfest was in Januari:

URL:https://commitfest.postgresql.org/action/commitfest_view?id=13

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Temporary tables under hot standby

2012-04-25 Thread Nicolas Barbier
2012/4/25 Simon Riggs si...@2ndquadrant.com:

 On Wed, Apr 25, 2012 at 4:55 AM, Noah Misch n...@leadboat.com wrote:

 I do not see a clean behind-the-scenes fix for points 1, 4 and 5.  We can
 resolve those by adding a new variety of temporary table, one coincidentally
 matching the SQL standard's notion of a temporary table.  The developer will
 declare it once, after which all sessions observe it as an initially-empty
 table whose contents remain local to the session.

[..]

 I don't see how introducing a new type of temp table solves this
 problem. How would the developer declare this in HS? How would it then
 be globally visible without using global OIDs, causing sinval and
 using global locks?

The declarative creation of an “standard-like” temporary table only
happens once (it is part of the schema). Using (e.g. putting stuff in
and executing queries on) such tables can happen on the standby
without the master having to know.

Therefore, I don't see the problem. Just schedule issuing the creation
along with any other schema-changes on the master.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Temporary tables under hot standby

2012-04-25 Thread Nicolas Barbier
2012/4/25 Simon Riggs si...@2ndquadrant.com:

 So you are saying it is OK to not be able to *create* them on HS, just
 *use* pre-defined tables?

 That's almost useless IMHO.

 Applications expect to be able to do this all in the same transaction
 on one session
 CREATE TEMP TABLE x;
  ...DML commands...
 SELECT ... FROM x;

That’s not how standard-like temporary tables work, they are supposed
to be declared beforehand. That makes sense if you consider the schema
and the set of database-using applications as one. I assume that
wanting to define applications independently from the database schema
is the reason of existence for the PG-like temporary transactions.

The way standard-like temporary tables work is exactly why I assume
Noah proposes to implement them: because they work nicely with HS.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Temporary tables under hot standby

2012-04-25 Thread Nicolas Barbier
2012/4/25 Nicolas Barbier nicolas.barb...@gmail.com:

 is the reason of existence for the PG-like temporary transactions.

s/transactions/tables/

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM ANALYZE is faster than ANALYZE?

2012-02-22 Thread Nicolas Barbier
2012/2/22 Robert Haas robertmh...@gmail.com:

 On Tue, Feb 21, 2012 at 9:00 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:

 I had to reply to query about usage VACUUM ANALYZE or ANALYZE. I
 expected so ANALYZE should be faster then VACUUM ANALYZE.

 But is not true. Why?

 I'm pretty sure that VACUUM ANALYZE *will* be faster than ANALYZE in
 general, because VACUUM has to scan the whole table, and ANALYZE only
 a fixed-size subset of its pages.

It sounds like you just said the opposite of what you wanted to say.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-04 Thread Nicolas Barbier
2012/1/4 Simon Riggs si...@2ndquadrant.com:

 On Wed, Jan 4, 2012 at 9:20 AM, Andres Freund and...@anarazel.de wrote:

 I wonder if CRC32c wouldn't be a good alternative given more and more cpus
 (its in SSE 4.2) support calculating it in silicon.

 We're trying to get something that fits in 16bits for this release.
 I'm guessing CRC32c doesn't?

What happens to the problem-detecting performance of a 16 bit part of
a CRC32c vs. a real 16 bit checksum? If it is still as good, it might
make sense to use the former, assuming that there is a way to easily
trigger the silicon support and enough CPUs support it.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-29 Thread Nicolas Barbier
2011/12/30 Ants Aasma ants.aa...@eesti.ee:

 On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:

 positives.  To get this right for a checksum in the page header,
 double-write would need to be used for all cases where
 full_page_writes now are used (i.e., the first write of a page after
 a checkpoint), and for all unlogged writes (e.g., hint-bit-only
 writes).  There would be no correctness problem for always using
 double-write, but it would be unnecessary overhead for other page
 writes, which I think we can avoid.

 Unless I'm missing something, double-writes are needed for all writes,
 not only the first page after a checkpoint. Consider this sequence of
 events:

 1. Checkpoint
 2. Double-write of page A (DW buffer write, sync, heap write)
 3. Sync of heap, releasing DW buffer for new writes.
  ... some time goes by
 4. Regular write of page A
 5. OS writes one part of page A
 6. Crash!

 Now recovery comes along, page A is broken in the heap with no
 double-write buffer backup nor anything to recover it by in the WAL.

I guess the assumption is that the write in (4) is either backed by
the WAL, or made safe by double writing. ISTM that such reasoning is
only correct if the change that is expressed by the WAL record can be
applied in the context of inconsistent (i.e., partially written)
pages, which I assume is not the case (excuse my ignorance regarding
such basic facts).

So I think you are right.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inlining comparators as a performance optimisation

2011-12-06 Thread Nicolas Barbier
2011/12/5 Tom Lane t...@sss.pgh.pa.us:

 What is bothering me is that this approach is going to cause substantial
 bloat of the executable code, and such bloat has got distributed costs,
 which we don't have any really good way to measure but for sure
 micro-benchmarks addressing only sort speed aren't going to reveal them.
 Cache lines filled with sort code take cycles to flush and replace with
 something else.

 I think it's possibly reasonable to have inlined copies of qsort for a
 small number of datatypes, but it seems much less reasonable to have
 multiple copies per datatype in order to obtain progressively tinier
 micro-benchmark speedups.  We need to figure out what the tradeoff
 against executable size really is, but so far it seems like you've been
 ignoring the fact that there is any such tradeoff at all.

[ Randomly spouting ideas here: ]

Might it not be a good idea to decide whether to use the inlined
copies vs. the normal version, based on how much data to sort? Surely
for a very large sort, the cache blow-out doesn't matter that much
relative to the amount of time that can be saved sorting?

Assuming that all types would have an inlined sort function, although
that would indeed result in a larger binary, most of that binary would
never touch the cache, because corresponding large sorts are never
performed. If they would sporadically occur (and assuming the points
at which inline sorting starts to get used are chosen wisely), the
possibly resulting cache blow-out would be a net win.

I am also assuming here that instruction cache lines are small enough
for case line aliasing not to become a problem; putting all sort
functions next to each other in the binary (so that they don't alias
with the rest of the backend code that might be used more often) might
alleviate that.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch Review: Bugfix for XPATH() if text or attribute nodes are selected

2011-07-13 Thread Nicolas Barbier
2011/6/29, Florian Pflug f...@phlo.org:

 Secondly, there is little point in having an type XML if we
 don't actually ensure that values of that type can only contain
 well-formed XML.

+1. The fact that XPATH() must return a type that cannot depend on the
given expression (even if it is a constant string) may be unfortunate,
but returning XML-that-is-not-quite-XML sounds way worse to me.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI modularity questions

2011-06-28 Thread Nicolas Barbier
2011/6/28, Robert Haas robertmh...@gmail.com:

 You know, it just occurred to me while reading this email that you're
 using the term predicate lock in a way that is totally different
 from what I learned in school.  What I was taught is that the word
 predicate in predicate lock is like the word tuple in tuple
 lock or the word relation in relation lock - that is, it
 describes *the thing being locked*.  In other words, you are
 essentially doing:

 LOCK TABLE foo WHERE i = 1;

 I think that what you're calling the predicate lock manager should
 really be called the siread lock manager, and all of the places where
 you are predicate locking a tuple should really be siread locking
 the tuple.

The predicate in the full table case is: any tuple in this table
(including tuples that don't exist yet, otherwise it wouldn't be a
predicate). The predicate in the index case is: any tuple that would
be returned by so-and-such index scan (idem regarding tuples that
don't exist yet, hence locking the gaps).

The lock semantics (i.e., how conflicts between it and other locks are
defined and treated) are siread. The thing that it applies to is a
predicate. (I.e., PostgreSQL before SSI already supported some rather
trivial kind of predicate lock: the full table lock.)

Conclusion: I don't see the problem :-).

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XPATH evaluation

2011-06-17 Thread Nicolas Barbier
2011/6/17, Andrew Dunstan and...@dunslane.net:

 On 06/17/2011 10:55 AM, Radosław Smogura wrote:

 XML canonization preservs whitespaces, if I remember
 well, I think there is example.

 In any case if I will store image in XML (I've seen this), preservation of
 white spaces and new lines is important.

 If you store images you should encode them anyway, in base64 or hex.

Whitespace that is not at certain obviously irrelevant places (such as
right after , between attributes, outside of the whole document,
etc), and that is not defined to be irrelevant by some schema (if the
parser is schema-aware), is relevant. You cannot just muck around with
it and consider that correct.

 More generally, data that needs that sort of preservation should
 possibly be in CDATA nodes.

CDATA sections are just syntactic sugar (a form of escaping):

URL:http://www.w3.org/TR/xml-infoset/#omitted

Appendix D: What is not in the Information Set
[..]
19. The boundaries of CDATA marked sections.

Therefore, there is not such thing as a CDATA node that would be
different from just text (Infoset-wise).

Note that that does not mean that binary data is never supposed to be
altered or that all binary data is to be accepted: e.g., whether
newlines are represented using \n, \r, or \r\n is irrelevant;
also, binary data that is not valid according to the used encoding
must of course not be accepted.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] XPATH evaluation

2011-06-17 Thread Nicolas Barbier
2011/6/17, Andrew Dunstan and...@dunslane.net:

 On 06/17/2011 11:29 AM, Nicolas Barbier wrote:

 CDATA sections are just syntactic sugar (a form of escaping):

 Yeah. OTOH doesn't an empty CDATA section force a child element, where a
 pure empty element does not?

Wow, some Googling around shows that there is much confusion about
this. I thought that it was obvious that adding ![CDATA[]] shouldn't
change the content at all, but quite a few people seem to disagree
:-/.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube Index Size

2011-05-31 Thread Nicolas Barbier
2011/5/30, Nick Raj nickrajj...@gmail.com:

 3. When tuples are 5 lakh

For the benefit of the others: 5 lakh seems to mean 500,000.

URL:http://en.wikipedia.org/wiki/Lakh

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Please test peer (socket ident) auth on *BSD

2011-05-31 Thread Nicolas Barbier
2011/5/31, Tom Lane t...@sss.pgh.pa.us:

 Unless maybe there's a kFreeBSD-like project out there with NetBSD as
 the kernel?)

There used to be an attempt by Debian (called GNU/NetBSD), but that
has since long been abandoned. I don't know of any other similar
projects.

URL:http://www.debian.org/ports/netbsd/

Wikipedia doesn't list any other similar projects either:

URL:http://en.wikipedia.org/wiki/GNU_variants#BSD_variants

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FailedAssertion

2011-05-12 Thread Nicolas Barbier
[ Forgot the list. ]

2011/5/12, Yves Weißig weis...@rbg.informatik.tu-darmstadt.de:

 I'm currently debugging my developed AM and are running into this problem:

 TRAP: FailedAssertion(!(((bool) (((void*)(tid) != ((void *)0)) 
 ((tid)-ip_posid != 0, File: indexam.c, Line: 488)

 Can anybody explain what it means? I'm having difficulties to understand
 what went wrong.

Just in case you are wondering about the details of the macro-expansion:

indexam.c:484:

/* Switch to correct buffer if we don't have it already */
prev_buf = scan-xs_cbuf;
scan-xs_cbuf = ReleaseAndReadBuffer(scan-xs_cbuf,
scan-heapRelation,
ItemPointerGetBlockNumber(tid));

I.e., what you see is the expansion of ItemPointerGetBlockNumber,
which is (itemptr.h:69):

AssertMacro(ItemPointerIsValid(pointer)), \
BlockIdGetBlockNumber((pointer)-ip_blkid) \

ItemPointerIsValid is (itemptr.h:62):

((bool) (PointerIsValid(pointer)  ((pointer)-ip_posid != 0)))

I.e., ultimately the whole thing checks whether tid is non-null and
its ip_posid member is non-zero (whether tid is a valid item
pointer).

I hope that helps you to figure out what happens in the context of
what you are doing.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] the big picture for index-only scans

2011-05-11 Thread Nicolas Barbier
2011/5/11, Bruce Momjian br...@momjian.us:

 FYI, because the visibility map is only one _bit_ per page, it is 8000 *
 8 or 64k times smaller than the heap, e.g. one 8k page covers 64MB of
 heap pages.

Actually, that would be one 8kB block covers 512MB of heap: 1 block
of visibility map (8kB) = 64k visibility bits = covers 64k blocks =
covers 512MB of heap. The cost of keeping the visibility map in cache
is therefore totally negligible, only the cost of WAL logging changes
to it is of interest.

 This is important because we rely on this compactness in hope that
 the WAL logging of this information will not be burdensome.

The size of on entry in the map (1 bit) is not very related to the WAL
overhead required per change of such a bit (i.e., the log record for a
1 bit change will certainly be way more than 1 bit).

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Predicate locking

2011-04-27 Thread Nicolas Barbier
2011/4/27 Vlad Arkhipov arhi...@dc.baikal.ru:

 I'm currently need predicate locking in the project, so there are two ways
 to get it by now: implement it by creating special database records to lock
 with SELECT FOR UPDATE or wait while they will be implemented in Postgres
 core. Is there something like predicate locking on the TODO list currently?

I assume you want (real, as opposed to what is in  9.1 now)
SERIALIZABLE transactions, in which case you could check:

URL:http://wiki.postgresql.org/wiki/Serializable

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostgreSQL sourcecode

2011-03-30 Thread Nicolas Barbier
2011/3/30 aaronenabs aaronen...@btconnect.com:

 Hi all i have been trying to compile the sourcecode for postgresql but keep
 getting an error message when running it in cygwin.

 it states:

 dllwrap: gcc exited with status 1
 make[3]: *** [cygpq.dll] Error 1
 make[3]: Leaving directory
 `/postgresql-9.0.3/postgresql-9.0.3/src/interfaces/li
 bpq'
 make[2]: *** [all] Error 2
 make[2]: Leaving directory
 `/postgresql-9.0.3/postgresql-9.0.3/src/interfaces'
 make[1]: *** [all] Error 2
 make[1]: Leaving directory `/postgresql-9.0.3/postgresql-9.0.3/src'
 make: *** [all] Error 2

FYI, you left out the error message stating the real problem that
probably came right before the part you pasted here. The Make[n]
things are just the submakes that return an error recursively.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostgreSQL sourcecode

2011-03-30 Thread Nicolas Barbier
2011/3/30 aaronenabs aaronen...@btconnect.com:

 Can you alos advise how i change the the HeapTupleSatisfiesVisibility() to
 true within the source code:

[..]

 #define HeapTupleSatisfiesVisibility(tuple, snapshot, buffer) \
        ((*(snapshot)-satisfies) ((tuple)-t_data, snapshot, buffer))

As someone totally not familiar with the PostgreSQL source code, I
would guess something like:

#define HeapTupleSatisfiesVisibility(tuple, snapshot, buffer) (1)

You might want to check the return type of the satisfies function
pointer though.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] I am confused after reading codes of PostgreSQL three week

2011-03-20 Thread Nicolas Barbier
2011/3/20 hom obsidian...@gmail.com:

 I trace into scan.c because I want to known how the paser tree is
 built and I debug the source step by step.

I suggest you learn how flex/bison work first. The contents of the *.c
files generated by flex/bison are not generally supposed to be
interpreted by humans, rather you should read their original sources
(*.l and *.y).

 Then the eclipse pick up the scan.I and the excute order does not
 match the code.

Eclipse seems to understand that any code corresponding to the
generated .c file actually originates in the .l file, but apparently
fails to match (some of?) the line numbers. OTOH, I cannot really
imagine how it is supposed to match them as long as you are not
executing lines that are literally copied from the .l file (e.g., as
long as the lexer or parser code itself is being executed), so that
may be normal.

Again: Do not try to read the generated .c files, but rather read the
corresponding .l and .y files. The tarballs may include those
generated .c files, but as you will find out when checking out the
repository itself, they are not really considered source (i.e., they
are not included). When debugging, skip over the lexer and parser code
itself, just put your breakpoints in the C code in the .l and .y files
(I hope Eclipse might match *those* line numbers a least, and make the
breakpoints work).

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: Multi-language glossary of PostgreSQL terms.

2011-03-15 Thread Nicolas Barbier
2011/3/15 Dmitriy Igrishin dmit...@gmail.com:

 Oleg Bartunov and I intend to begin the project of translation
 of PostgreSQL documentation on Russian.

Just to make sure that you are aware of a recently-started effort to
create a German translation, by Susanne and others:

URL:http://archives.postgresql.org/pgsql-hackers/2011-03/msg00346.php

I guess it may be a good idea to share some
process/tools/infrastructure to avoid duplicating work.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Read uncommitted ever possible?

2011-03-10 Thread Nicolas Barbier
2011/3/10 Jesper Krogh jes...@krogh.cc:

 On 2011-03-10 19:25, Bruce Momjian wrote:

 Sure, anyone can add text to that wiki;  you create a community account
 here:

        http://www.postgresql.org/community/signup

 Suggestion: Add this url to the login box on the wiki.

+1, Adrian von Bidder had the same problem just two days ago.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What happens If a table changes during a query/procedure execution

2011-03-09 Thread Nicolas Barbier
2011/3/9 Vlad Arkhipov arhi...@dc.baikal.ru:

 Let there are two transactions that were created with read commited
 isolation level. In the first one we're executing a SELECT query:
 SELECT * FROM t UNION ALL SELECT * FROM t;

 In the second transaction we're modifying the same table:
 INSERT INTO t DEFAULT VALUES;
 COMMIT;

 Is it possible that the last UNION part in the first query will retrieve not
 the same rows as the first one?

No, because statements never see changes made by other transactions
while they are in flight.

 Another scenario is where we're executing two SELECT queries in a stored
 procedure:
 BEGIN
  ...
  SELECT * FROM t;
  SELECT * FROM t;
 END;

 Is it possible to get different results in the second query?

Yes, because they are separate statements, and in READ COMMITTED mode,
a new snapshot is taken when a statement starts. See:
URL:http://www.postgresql.org/docs/9.0/static/transaction-iso.html#XACT-READ-COMMITTED.

 Does SQL standard define the behaviour in such cases?

The first one certainly. The standard doesn't describe PL/PgSQL, so
the question is moot in the second case; nonetheless, I assume that
the answer would be yes in the case of SQL/PSM.

Note that the standard defines things that must never happen in the
case of READ COMMITTED, it does not specify that one *must* be able to
see the stuff as committed by previous transactions, for example.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What happens If a table changes during a query/procedure execution

2011-03-09 Thread Nicolas Barbier
2011/3/9 Nicolas Barbier nicolas.barb...@gmail.com:

 Note that the standard defines things that must never happen in the
 case of READ COMMITTED, it does not specify that one *must* be able to
 see the stuff as committed by previous transactions, for example.

Hmm, make that stuff as committed by concurrent transactions that
committed prior to our statement's execution.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Native XML

2011-03-02 Thread Nicolas Barbier
2011/3/1 Andrew Dunstan and...@dunslane.net:

 I think hierarchical data really only scratches the surface of the problem.
 It would be nice to be able to specify all sorts of context for searches:

   * foo after bar
   * foo near bar
   * foo and bar in the same paragraph
   * foo as a parent/child/ancestor/descendent/sibling/cousin of bar

I wonder whether you are deliberately describing XPath here? :-)

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sorting. When?

2011-02-11 Thread Nicolas Barbier
[ Please don't top-post. URL:http://en.wikipedia.org/wiki/Posting_style ]

2011/2/11 mac_man2...@yahoo.it mac_man2...@yahoo.it:

 So, is there any precise way to discover when sorting is invoked?

EXPLAIN shows how a query would be executed; explicit sorts should be
mostly obvious.
URL:http://www.postgresql.org/docs/9.0/static/sql-explain.html

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sorting. When?

2011-02-11 Thread Nicolas Barbier
2011/2/11 Kevin Grittner kevin.gritt...@wicourts.gov:

 mac_man2...@yahoo.it mac_man2...@yahoo.it wrote:

 I need to know, from an algorithmic point of view, in which cases
 sorting is invoked.

[..]

 Are your really looking to categorize the types of queries where
 sorting is *invoked*, or the ones where it is *considered*?  Or
 perhaps only those where it is *required*, since there are no
 possible plans without sorting?

Or, if you are seeking the exact rules that are used by the planner to
determine all possible plans from which the one with minimum cost is
chosen (and hence all ways in which sorts can be used), I think that
the source code is the only complete reference. A non-complete
introduction is:

URL:http://www.postgresql.org/docs/9.0/static/planner-optimizer.html

Basically, physical operators (seq scan, index scan, hash join, merge
join, nested loop, filter, set operation, etc) may require their input
to satisfy certain sort constraints (for example, both inputs of a
merge join need to be sorted on the join attribute(s)). If it happens
to be of lowest cost to explicitly sort the inputs right before
consuming them, that will be done. If there is a way to get the same
input in an already-ordered way (for example an index scan, or the
output of a merge join), so that the cost is less than the non-ordered
way + an explicit sort, then that already-ordered way will be chosen.

Super-basic example:

SELECT * FROM t ORDER BY a;

This may either perform a seq scan of table t and then do an explicit
sort, or perform a full index scan of an index on attribute a
(provided that such an index exists), in which case the explicit sort
is not needed because the index scan will deliver the rows in
already-sorted order. Which option is chosen depends on the cost: The
costs of both plans are calculated and the least costly plan is
chosen. See the (non-exhaustive) list of things that influence the
costs provided by Kevin to get a feeling for how many variables there
are that influence this choice.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sorting. When?

2011-02-10 Thread Nicolas Barbier
2011/2/10 mac_man2...@yahoo.it mac_man2...@yahoo.it:

 Which operations invoke the sorting algorithms implemented in the sorting
 module (tuplesort.c) ?
 Of course, one of those operations invoking sorting is the ORDER BY clause
 and the DISTINCT too.

 Moreover, the Merge Join should be implemented invoking sorting.

 Is there any other operation invoking sorting?

AFAIK, all set operators except for UNION ALL. (I am probably missing
a whole boatload of other things.)

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] query execution question

2011-02-09 Thread Nicolas Barbier
2011/2/9 amit sehas cu...@yahoo.com:

 Lets say that the cost based optimizer determines that the order of the
 joins should be   T1.a=T2.b followed by T2.c = T3.d followed by T3.e = T4.f

 the question we have is during query execution are the joins evaluated
 completely one by one in that order, or the first join is evaluated
 completely and generates an intermediate table which is then utilized
 to perform the next jointhis means that for such a query we will need
 space for all the intermediate tables, which if they are very large tables
 as they are in our case can significantly alter the cost of the operations...

[ This is a question more appropriate for pgsql-performance. ]

The optimizer doesn't only determine the order (or tree actually) in
which to perform the joins, but also how to perform them: nested loop,
merge, or hash join. Depending on those physical join types, something
might need to be materialized (merge: the intermediate sort tapes;
hash: the full outer operand's contents) or not (nested loop).

Please see the EXPLAIN statement if you want to know how the query
would be executed.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bad links in messages from commits

2011-02-01 Thread Nicolas Barbier
2011/2/1 Magnus Hagander mag...@hagander.net:

 On Tue, Feb 1, 2011 at 07:56, Pavel Stehule pavel.steh...@gmail.com wrote:

 2011/2/1 Magnus Hagander mag...@hagander.net:

 On Tue, Feb 1, 2011 at 05:53, Pavel Stehule pavel.steh...@gmail.com wrote:

 Hello

 There are broken links inside messages from commiters.

 projects /


 404 - No such project

 Are you using gmail? They have made some changes recently that breaks
 the viewing of the URLs. Haven't heard any non-gmail user complain,
 and not entirely sure how to fix it. a workaround is to use copy link
 location or whatever it's called in your browser and then paste that
 - that works without errors.

 yes, you has true. It's gmail bug.

 I'm actually not entirely sure it's a gmail bug - it looks more like
 it's at least partially gitweb's fault, but I'm not entirely sure...

Hgweb (the Mercurial equivalent of gitweb) has the same problem. I
have the impression that the semicolons in the URLs are initially not
escaped correctly or something. Both hgweb and gitweb use semicolons
in the URL; hgweb shows no such method: changeset;node=69b90bdd52d1
when following such a ...?cmd=changeset;node=69b90bdd52d1 link from
GMail (using Chrome, in case that matters).

Workaround: When I press enter on the URL in the address bar while the
404 is showing, the page reloads correctly without 404-ing.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: In pg_test_fsync, use K(1024) rather than k(1000) for write size units.

2011-01-27 Thread Nicolas Barbier
2011/1/27 Bruce Momjian br...@momjian.us:

 Bruce Momjian wrote:

 Peter Eisentraut wrote:

 We use small k in postgresql.conf, so pg_test_fsync should use the
 same.  Using kB would be more accurate in any case.

 OK, done with the attached applied patch.

 FYI, I had used 'k' because this page suggests that k is 1000 and K is
 1024, at least by the JEDEC memory standards:

        http://en.wikipedia.org/wiki/Kilo

I can't find any reference to that on this page? The following does indeed say:

URL:http://en.wikipedia.org/wiki/JEDEC_memory_standards

quote
kilo (K): A multiplier equal to 1,024 [..] The specification notes
that these prefixes are included in the document only to reflect
common usage. It refers to the IEEE/ASTM SI 10-1997 standard as
stating, that this practice frequently leads to confusion and is
deprecated.
/quote

If you want to make the difference explicit, consider using KiB
(1024, note the extra i) vs. kB (1000); although doing so is
probably not consistent with any other uses in PostgreSQL.

URL:http://en.wikipedia.org/wiki/Kibibyte

quote
The unit symbol for the kibibyte is KiB. The unit was established by
the International Electrotechnical Commission (IEC) in 1999 and has
been accepted for use by all major standards organizations.
/quote

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI and Hot Standby

2011-01-21 Thread Nicolas Barbier
2011/1/21 Anssi Kääriäinen anssi.kaariai...@thl.fi:

 Sorry for bothering all of you, but I just don't get this. What if T2 rolls
 back instead of committing? Then the snapshot of T3 would have been valid,
 right? Now, for the snapshot of T3 it doesn't matter if T2 commits or if it
 doesn't, because it can't see the changes of T2 in any case. Thus, it would
 seem that the snapshot is valid. On the other hand I can't see anything
 wrong in the logic in your post. What am I missing? I am feeling stupid...

 At least for dumps I don't see how T2 can matter (assuming T3 is the
 pg_dump's snapshot). Because if you reload from the dump, T2 never happened
 in that dump. In the reloaded database it just did not exist at all.

This has been discussed before; in [1] I summarized:

IOW, one could say that the backup is consistent only if it were
never compared against the system as it continued running after the
dump took place.

Nicolas

[1] URL:http://archives.postgresql.org/pgsql-hackers/2010-09/msg01763.php

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI and Hot Standby

2011-01-21 Thread Nicolas Barbier
2011/1/21 Robert Haas robertmh...@gmail.com:

 On Fri, Jan 21, 2011 at 8:05 AM, Nicolas Barbier
 nicolas.barb...@gmail.com wrote:

 2011/1/21 Anssi Kääriäinen anssi.kaariai...@thl.fi:

 Sorry for bothering all of you, but I just don't get this. What if T2 rolls
 back instead of committing? Then the snapshot of T3 would have been valid,
 right? Now, for the snapshot of T3 it doesn't matter if T2 commits or if it
 doesn't, because it can't see the changes of T2 in any case. Thus, it would
 seem that the snapshot is valid. On the other hand I can't see anything
 wrong in the logic in your post. What am I missing? I am feeling stupid...

 At least for dumps I don't see how T2 can matter (assuming T3 is the
 pg_dump's snapshot). Because if you reload from the dump, T2 never happened
 in that dump. In the reloaded database it just did not exist at all.

 This has been discussed before; in [1] I summarized:

 IOW, one could say that the backup is consistent only if it were
 never compared against the system as it continued running after the
 dump took place.

 But that's a pretty fair way to look at it, isn't it?

Indeed, I just wanted to point Anssi to the previous discussion.

 It occurs to me that focusing on how this is going to work on Hot
 Standby might be looking at the question too narrowly.  The general
 issue is - does this technique generalize to a distributed computing
 environment, with distributed transactions across multiple PostgreSQL
 databases?  For example, what if the control record in Kevin's example
 is stored in another database, or on another server.  Or what if some
 tables are replicated via Slony?  I realize this is all outside the
 scope of the patch, but that's exactly the point: making this stuff
 work across multiple databases (even if they are replicas of each
 other) is much more complex than getting it to work on just one
 machine.  Even if we could agree on how to do it, coming up with some
 hack that can only ever possibly work in the Hot Standby case might
 not be the best thing to do.

You seem to be questioning whether the normal (?) way of using 2PC on
multiple DBs (just start transactions, and let 2PC coordinate the
commits) that all use SERIALIZABLE isolation mode always results in
global serializable behavior.

I must say that I don't immediately see the answer (my gut feeling
says nope), but it sure is an interesting question.

In the special case where all databases use SS2PL as their concurrency
mechanism, the answer is yes. (Because any possibly conflicting
local transactions of any global transactions that touch the same
tables (conflict) will necessarily commit in some globally
consistent order, which in the case of SS2PL is consistent with the
must-have-happened-before-in-any-serialized-order relation, i.e.,
using the terminology from the literature: because SS2PL is CO.)

I don't know whether it is generally considered a requirement for a
2PC environment to guarantee global serializability.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ToDo List Item - System Table Index Clustering

2011-01-16 Thread Nicolas Barbier
2011/1/16 Simone Aiken sai...@ulfheim.net:

        is there a way to make queries on the system tables show me what
        is actually there when I'm poking around?  So for example:

                Select * from pg_type limit 1;

        tells me that the typoutput is 'boolout'.  An english string rather 
 than
        a number.  So even though the documentation says that column
        maps to pg_proc.oid I can't then write:

                Select * from pg_proc where oid = 'boolout';

Type type of typoutput is regproc, which is really an oid with a
different output function. To get the numeric value, do:

Select typoutput::oid from pg_type limit 1;

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] join functions

2011-01-07 Thread Nicolas Barbier
2011/1/5 Zotov zo...@oe-it.ru:

 Why doesn`t work this query?
 select table1.field1, func1.field2 from table1 left outer join
 func1(table1.field1) on true where func1.field3 in (20, 100);

 If i have other than LEFT OUTER JOIN I can understand why
   ERROR:  invalid reference to FROM-clause entry for table table1

 but why here?
   for each row of table1 just evaluate func1(table1.field1)

That seems like a use case for LATERAL, which is not supported yet.

Some recent discussion seems to be
URL:http://archives.postgresql.org/pgsql-hackers/2009-09/msg00292.php.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal : cross-column stats

2010-12-24 Thread Nicolas Barbier
2010/12/24 Florian Pflug f...@phlo.org:

 On Dec23, 2010, at 20:39 , Tomas Vondra wrote:

   I guess we could use the highest possible value (equal to the number
   of tuples) - according to wiki you need about 10 bits per element
   with 1% error, i.e. about 10MB of memory for each million of
   elements.

 Drat. I had expected these number to come out quite a bit lower than
 that, at least for a higher error target. But even with 10% false
 positive rate, it's still 4.5MB per 1e6 elements. Still too much to
 assume the filter will always fit into memory, I fear :-(

I have the impression that both of you are forgetting that there are 8
bits in a byte. 10 bits per element = 1.25MB per milion elements.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v20 (bitrot fixes)

2010-12-20 Thread Nicolas Barbier
2010/12/20 Martijn van Oosterhout klep...@svana.org:

 On Mon, Dec 20, 2010 at 09:03:56AM +0900, Itagaki Takahiro wrote:

 UTF-8 is not a superset of all encodings.

 I think you mean Unicode is not a superset of all character sets. I've
 heard this before but never found what's missing. [citation needed]?

From 
URL:http://en.wikipedia.org/wiki/Japanese_language_and_computers#Character_encodings:

Unicode is supposed to solve all encoding problems in all languages
of the world. [..] There are still controversies. For Japanese, the
kanji characters have been unified with Chinese, that is a character
considered to be the same in both Japanese and Chinese have been given
one and the same code number in Unicode, even if they look a little
different. This process, called Han unification, has caused
controversy.

For examples (my browser doesn't show any differences though, probably
because I don't have the corresponding fonts):

URL:http://en.wikipedia.org/wiki/Han_unification#Examples_of_language_dependent_characters

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v18 (merge against master, bitrot-only-fixes)

2010-12-16 Thread Nicolas Barbier
2010/12/16 Robert Haas robertmh...@gmail.com:

 On Thu, Dec 16, 2010 at 7:49 AM, Dimitri Fontaine dimi...@2ndquadrant.fr 
 wrote:

 Please note that the SQL scripts seem to be encoded in latin9.

 Seems like an odd choice.  Why not UTF-8?

Latin 9 = ISO 8859-15 = a more modern version of Latin 1 (ISO 8859-1),
which includes the € symbol for example. I.e., it's not that weird. As
long as there are no non-ASCII characters, it seems even likely that
some tools might detect a simple ASCII text file as Latin 9 by
default.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mvcc DML on the same row

2010-12-15 Thread Nicolas Barbier
2010/12/15 matteo durighetto desmodem...@gmail.com:

 But why we need all these versions of the same row on table, if for
 rollback we need only the original row X (X0) ?

And the previous value of row X during the execution of a statement
(because statements don't see their own changes, think INSERT INTO a
SELECT * FROM a). And any values that we might need to ROLLBACK TO
SAVEPOINT to.

 So I think we need it in memory, not on physical space of table (ok
 there is the cache, but ..) or something similar

It must be possible to push out those changes from memory to disk
anyway, because there is no limit on how many rows a transaction can
update in PostgreSQL (vs. Oracle's snapshot too old problems). But
then, keeping the locally updated rows in some kind of special
per-transaction cache or in the global page cache isn't that
different.

Also, updating the same row many times in one transaction is probably
not regarded a very typical use case.

Note that other DBMSs may implement MVCC more along the lines you
specified; AFAIR, InnoDB uses such an approach. This may mean that
they don't need VACUUM.

I think that the consensus is that there is a trade-off between doing
VACUUM-like things synchronously, or having the possibility to do it
asynchronously at times when load is low. In PostgreSQL, the latter
was chosen.

Btw, this topic has been discussed at length in the past, please check
the archive.

 or this method is for transaction with isolation level at read
 uncommited?

PostgreSQL implements READ UNCOMMITTED as READ COMMITTED (providing a
higher level of isolation than requested is allowed by the standard),
so that is definitely not the reason.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER TABLE ... ADD FOREIGN KEY ... NOT ENFORCED

2010-12-13 Thread Nicolas Barbier
2010/12/13 Tom Lane t...@sss.pgh.pa.us:

 But allow me to harbor doubts that they really
 intend to allow someone to force a constraint to be considered valid
 without any verification.

Table constraints are either enforced or not enforced. Domain
constraints and assertions are always enforced., 4.17.2

I don't read that as meaning that such unenforced constraints are
considered valid. It sounds as if unenforced constraints are the
same as non-existing constraints (think: constraint templates),
possibly as a means to remember that they should be re-enabled at
some point.

I.e., marking a constraint as unenforced and then as enforced again
would be a shortcut for removing and re-adding the constraint, while
having the advantage that one doesn't have to keep a list of
constraint definitions that must be re-added.

 (In particular, can a constraint
 go from not-enforced to enforced state without getting checked at that
 time?)

I assume not.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MULTISET and additional functions for ARRAY

2010-11-11 Thread Nicolas Barbier
2010/11/11 David E. Wheeler da...@kineticode.com:

 On Nov 11, 2010, at 10:05 AM, Tom Lane wrote:

 So are you planning to implement multisets? It's a feature I'd love to see

 What actual functionality does it buy?  AFAICT from Itagaki-san's
 description, it's an array only you ignore the specific element order.
 So what?  You can write functions that work that way now.

 Also, no dupes.

The multi in multiset indicates that duplicate elements are
explicitly allowed and tracked.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-03 Thread Nicolas Barbier
2010/11/2 Kenneth Marshall k...@rice.edu:

 Given that our hash implimentation mixes the input data well (It does.
 I tested it.) then a simple rotate-and-xor method is all that should
 be needed to maintain all of the needed information. The original
 hash function has done the heavy lifting in this case.

Even with the perfect hash function for the elements, certain
combinations of elements could still lead to massive collisions. E.g.,
if repeated values are typical in the input data we are talking about,
then the rotate-and-xor method would still lead to collisions between
any array of the same values of certain lengths, regardless of the
value. In Tom's implementation, as he mentioned before, those
problematical lengths would be multiples of 32 (e.g., an array of 32
1s would collide with an array of 32 2s would collide with an array of
32 3s, etc).

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] client socket TIME_WAIT state after PQfinish

2010-09-27 Thread Nicolas Barbier
2010/9/27 Guillaume Du Pasquier guillaume.dupasqu...@sensometrix.ch:

 In both cases, the client socket (pgadmin or my program) remains in
 TIME_WAIT state.

 I have used wireshark to sniff the TCP protocol.

 We have at the end of a connection:

 Client      Server

    ---    FIN,ACK  ---
    ---    FIN,ACK  ---
    ---      ACK    ---

 This ends up in a TIME_WAIT state. The TCP protocol should be

According to the Two Generals' Problem [1], one of the sides
necessarily doesn't know whether the other side has received its last
packet. Therefore, TCP lets one of the sides sit in TIME_WAIT status
for as long as any packets could in principle survive on the network
(typically defined as 2 minutes on IP networks IIRC), and potentially
disturb a new connection between the same (dst IP, dst port, src IP,
src port) combination.

[1] URL:http://en.wikipedia.org/wiki/Two_Generals'_Problem

 Client      Server

    ---    FIN,ACK  ---
    ---      ACK    ---
    ---    FIN,ACK  ---

AFAIK, this last ACK (in above packet) is not needed: the server can
ACK the client's FIN _while_ it sends its own FIN (by using an
appropriate sequence number, as FIN uses one byte in the sequence).

    ---      ACK    ---

 I suppose there is a bug in the postgresql server that do not send an ack to
 the client.

I don't think so.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] client socket TIME_WAIT state after PQfinish

2010-09-27 Thread Nicolas Barbier
2010/9/27 Robert Haas robertmh...@gmail.com:

 On Mon, Sep 27, 2010 at 12:56 PM, Guillaume Du Pasquier
 guillaume.dupasqu...@sensometrix.ch wrote:

 Our client runs on the same machine as the postgresql server.
 Would it be possible to use PF_UNIX sockets ?

 Yeah, actually that's the default, if you just run psql with no
 parameters.  It looks for a socket in /tmp/.s.PGSQL.5432.

It depends on the interface; e.g., the JDBC driver doesn't support
unix sockets, AFAIR.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Serializable Snapshot Isolation

2010-09-25 Thread Nicolas Barbier
[ Forgot the list, resending. ]

2010/9/25 Greg Stark gsst...@mit.edu:

 On Thu, Sep 23, 2010 at 4:08 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:

 One place I'm particularly interested in using such a feature is in
 pg_dump. Without it we have the choice of using a SERIALIZABLE
 transaction, which might fail or cause failures (which doesn't seem
 good for a backup program) or using REPEATABLE READ (to get current
 snapshot isolation behavior), which might capture a view of the data
 which contains serialization anomalies.

 I'm puzzled how pg_dump could possibly have serialization anomalies.
 Snapshot isolation gives pg_dump a view of the database containing all
 modifications committed before it started and no modifications which
 committed after it started. Since pg_dump makes no database
 modifications itself it can always just be taken to occur
 instantaneously before any transaction which committed after it
 started.

I guess that Kevin is referring to [1], where the dump would take the
role of T3. That would mean that the dump itself must be aborted
because it read inconsistent data.

AFAICS, whether that reasoning means that a dump can produce an
inconsistent backup is debatable. After restoring, all transactions
that would have been in-flight at the moment the dump took its
snapshot are gone, so none of their effects happened. We would be in
exactly the same situation as if all running transactions would be
forcibly aborted at the moment that the dump would have started.

OTOH, if one would compare the backup with what really happened,
things may look inconsistent. The dump would show what T3 witnessed
(i.e., the current date is incremented and the receipts table is
empty), although the current state of the database system shows
otherwise (i.e., the current date is incremented and the receipts
table has an entry for the previous date).

IOW, one could say that the backup is consistent only if it were never
compared against the system as it continued running after the dump
took place.

This stuff will probably confuse the hell out of most DBAs :-).

Nicolas

[1] URL:http://archives.postgresql.org/pgsql-hackers/2010-05/msg01360.php

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Deadlock bug

2010-08-25 Thread Nicolas Barbier
2010/8/25 Simon Riggs si...@2ndquadrant.com:

 referenced meaning by an RI constraint, which only ever refers to
 PKs in other tables.

FK constraints can also point to non-PK UNIQUE columns.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Deadlock bug

2010-08-25 Thread Nicolas Barbier
2010/8/25 Simon Riggs si...@2ndquadrant.com:

 On Wed, 2010-08-25 at 16:14 +0200, Nicolas Barbier wrote:
 2010/8/25 Simon Riggs si...@2ndquadrant.com:

  referenced meaning by an RI constraint, which only ever refers to
  PKs in other tables.

 FK constraints can also point to non-PK UNIQUE columns.

 You're exactly correct and I now understand Markus' comment. Do you
 think that exact meaning prevents my proposal from being useful?

Not at all, because I guess that updates to non-UNIQUE columns are way
more common that updates to UNIQUE columns.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] About debug two versions of postgresql in eclipse

2010-08-24 Thread Nicolas Barbier
2010/8/24 Pei He hepeim...@gmail.com:

 I want to run two different versions of postgresql-8.2.5 under eclipse.

 But, it requires me to change PGDATA and LD_LIBRARY_PATH to switch.

 Moreover, to let eclipse know the changes, I need to modify .profile
 under my home folder, and log out and log in. (I am using Ubuntu.)

 Is there a way that I can debug two versions of postgresql without
 closing eclipse?

In the each run configuration, in the Environment tab, add those
environment variables, and given them values that correspond to the PG
version that corresponds to that run configuration.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] assertions and constraint triggers

2010-08-11 Thread Nicolas Barbier
2010/8/11 Marko Tiikkaja marko.tiikk...@cs.helsinki.fi:

 On 8/11/10 8:31 AM +0300, Peter Eisentraut wrote:

 Thinking about SQL assertions (check constraints that are independent of
 one particular table), do you think it would be reasonable to implement
 those on top of constraint triggers?  On creation you'd hook up a
 trigger to each of the affected tables.  And the trigger function runs
 the respective check expression.  Conceptually, this doesn't seem to be
 very far away from foreign key constraints after all.

 I thought the point of ASSERTIONs was that you could write a thing such as:

 CREATE ASSERTION foo CHECK ((SELECT count(*) FROM tbl) = 4);

 Enforcing that kind of constraints without true serializability seems
 impractical.

Exactly what I thought when I read this. Without true serializability,
the view of the database at any moment during a transaction doesn't
have to be the same as the view that a newly started transaction gets.
Therefore, checking that the assertion holds after changing something
doesn't necessarily guarantee that it will hold for any other
transactions.

To elaborate on a variant of Marko's example, where the = is
replaced with =. Assume non-true SERIALIZABLE transactions:

* The table has 3 rows.
* T1 inserts a row, and concurrently, T2 also inserts a row; after
each statement, the assertion is not violated for the corresponding
transaction's snapshot.
* The assertion is now violated for a subsequent transaction T3
(because it sees 5 rows).

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online backup cause boot failure, anyone know why?

2010-08-05 Thread Nicolas Barbier
2010/8/5 Richard husttrip...@vip.sina.com:

 All jods are done by client code, not manually.

What is a jod?

 I still did't not understand what you said.
 What break what?

The fact that you replaced CHECKPOINT_WAIT with CHECKPOINT_IMMEDIATE
is the cause of your problem. You broke the correctness of the
system by doing so.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC - code of implementation of materialized views

2010-06-30 Thread Nicolas Barbier
2010/6/30 Robert Haas robertmh...@gmail.com:

 By the way, does the SQL standard say anything about materialized views?

AFAIK, nope. Probably for the same reason that indexes are not
mentioned by the standard: both are only performance enhancements, and
one could easily imagine future SQL database systems that manage their
creation and removal automatically (based on usage patterns or
available disk space or somesuch).

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Exposing the Xact commit order to the user

2010-05-25 Thread Nicolas Barbier
2010/5/25 Dan Ports d...@csail.mit.edu:

 On Mon, May 24, 2010 at 10:24:07AM -0500, Kevin Grittner wrote:

 Replicating or recreating the whole predicate locking and conflict
 detection on slaves is not feasible for performance reasons. (I
 won't elaborate unless someone feels that's not intuitively
 obvious.) The only sane way I can see to have a slave database allow
 serializable behavior is to WAL-log the acquisition of a snapshot by
 a serializable transaction, and the rollback or commit, on the
 master, and to have the serializable snapshot build on a slave
 exclude any serializable transactions for which there are still
 concurrent serializable transactions. Yes, that does mean WAL-
 logging the snapshot acquisition even if the transaction doesn't yet
 have an xid, and WAL-logging the commit or rollback even if it never
 acquires an xid.

 One important observation is that any anomaly that occurs on the slave
 can be resolved by aborting a local read-only transaction. This is a
 good thing, because the alternatives are too horrible to consider.

 You could possibly cut the costs of predicate locking by having the
 master ship with each transaction the list of predicate locks it
 acquired. But you'd still have to track locks for read-only
 transactions, so maybe that's not a significant cost improvement. On
 the other hand, if you're willing to pay the price of serializability
 on the master, why not the slaves too?

I don't understand the problem. According to me, in the context of
SSI, a read-only slave can just map SERIALIZABLE to the technical
implementation of REPEATABLE READ (i.e., the currently-existing
SERIALIZABLE). The union of the transactions on the master and the
slave(s) will still exhibit SERIALIZABLE behavior because the
transactions on the slave cannot write anything and are therefore
irrelevant.

Is anything wrong with that reasoning?

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Exposing the Xact commit order to the user

2010-05-25 Thread Nicolas Barbier
2010/5/25 Dan Ports d...@csail.mit.edu:

 On Tue, May 25, 2010 at 02:00:42PM +0200, Nicolas Barbier wrote:

 I don't understand the problem. According to me, in the context of
 SSI, a read-only slave can just map SERIALIZABLE to the technical
 implementation of REPEATABLE READ (i.e., the currently-existing
 SERIALIZABLE). The union of the transactions on the master and the
 slave(s) will still exhibit SERIALIZABLE behavior because the
 transactions on the slave cannot write anything and are therefore
 irrelevant.

 This, unfortunately, isn't true in SSI.

 Consider read-only transactions on a single node SSI database -- the
 situation is the same for read-only transactions that run on a slave.
 These transactions can be part of anomalies, so they need to be checked
 for conflicts and potentially aborted.

 Consider Kevin's favorite example, where one table contains the current
 date and the other is a list of receipts (initially empty).
  T1 inserts (select current_date) into receipts, but doesn't commit
  T2 increments current_date and commits
  T3 reads both current_date and the receipt table
  T1 commits

 T3, which is a read-only transaction, sees the incremented date and an
 empty list of receipts. But T1 later commits a new entry in the
 receipts table with the old date. No serializable ordering allows this.
 However, if T3 hadn't performed its read, there'd be no problem; we'd
 just serialize T1 before T2 and no one would be the wiser.

 SSI would detect a potential conflict here, which we could resolve by
 aborting T3. (We could also abort T1, but if this is a replicated
 system this isn't always an option -- T3 might be running on the
 slave, so only the slave will know about the conflict, and it can't
 very well abort an update transaction on the master.)

Ah, indeed. I made the same reasoning mistake as Florian (presumably)
did: I didn't think of the fact that the read-only transaction doesn't
need to be the pivot.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Exposing the Xact commit order to the user

2010-05-25 Thread Nicolas Barbier
2010/5/25 Florian Pflug f...@phlo.org:

 On May 25, 2010, at 20:18 , Dan Ports wrote:

 T3, which is a read-only transaction, sees the incremented date and an
 empty list of receipts. But T1 later commits a new entry in the
 receipts table with the old date. No serializable ordering allows this.

 However, if T3 hadn't performed its read, there'd be no problem; we'd
 just serialize T1 before T2 and no one would be the wiser.

 Hm, so in fact SSI sometimes allows the database to be inconsistent, but only 
 as long as nobody tries to observe it?

I would not call this an inconsistent state: it would become
inconsistent only after someone (e.g., T3) has observed it _and_ T1
commits.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Exposing the Xact commit order to the user

2010-05-25 Thread Nicolas Barbier
2010/5/25 Florian Pflug f...@phlo.org:

 Hm, but for there to be an actual problem (and not a false positive), an
 actual dangerous circle has to exist in the dependency graph. The
 existence of a dangerous structure is just a necessary (but not
 sufficient) and easily checked-for condition for that, right? Now, if a
 read-only transaction only ever has outgoing edges, it cannot be part
 of a (dangerous or not) circle, and hence any dangerous structure it is
 part of is a false positive.

 I guess my line of reasoning is flawed somehow, but I cannot figure out why...

In the general case, wr dependencies also create must be serialized
before edges. It seems that those edges can be discarded when finding
a pivot, but if you want to go back to basics:

( means must be serialized before.)

* T1  T2, because T1 reads a version of a data element for which T2
later creates a newer version (rw between T1 and T2).
* T3  T1, because T3 reads a version of a data element for which T1
later creates a newer version (rw between T3 and T1).
* T2  T3, because T2 creates a version of a data element, which is
then read by T3 (wr between T2 and T3).

(As you can see, those 3 edges form a cycle.)

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Row-level Locks SERIALIZABLE transactions, postgres vs. Oracle

2010-05-14 Thread Nicolas Barbier
2010/5/14 Greg Stark gsst...@mit.edu:

 On Thu, May 13, 2010 at 10:25 PM, Florian Pflug f...@phlo.org wrote:

 C1: BEGIN
 C1: SELECT * FROM t WHERE id = 1 FOR UPDATE
 C2: BEGIN
 C2: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
 C2: SELECT * FROM t -- Take snapshot before C1 commits
 C1: COMMIT
 C2: DELETE FROM t WHERE id = 1
 C2: COMMIT

 Can you give an actual realistic example -- ie, not doing a select for
 update and then never updating the row or with an explanation of what
 the programmer is attempting to accomplish with such an unusual
 sequence? The rest of the post talks about FKs but I don't see any
 here...

The link with FKs is as follows:

* The example does not use a real FK, because the whole purpose is to
do the same as FKs while not using the FK machinery.
* The example uses only one table, because that is enough to
illustrate the problem (see next items).
* C1 locks a row, supposedly because it wants to create a reference to
it in a non-mentioned table, and wants to prevent the row from being
deleted under it.
* C2 deletes that row (supposedly after it verified that there are no
references to it; it would indeed not be able to see the reference
that C1 created/would create), and C1 fails to detect that.
* C2 also fails to detect the problem, because the lock that C1 held
is being released after C1 commits, and C2 can happily go on deleting
the row.
* The end result is that the hypothetical reference is created,
although the referent is gone.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partitioning/inherited tables vs FKs

2010-05-11 Thread Nicolas Barbier
2010/5/11 Marko Tiikkaja marko.tiikk...@cs.helsinki.fi:

 On 2010-05-11 14:29 +0200, Robert Haas wrote:

 On Tue, May 11, 2010 at 2:16 AM, Dmitry Fefelov fo...@ac-sw.com wrote:

 The referential integrity triggers contain some extra magic that isn't
 easily simulatable in userland, and that is necessary to make the
 foreign key constraints airtight.  We've discussed this previously but
 I don't remember which thread it was or the details of when things
 blow up.  I think it's something like this: the parent has a tuple
 that is not referenced by any child.  Transaction 1 begins, deletes
 the parent tuple (checking that it has no children), and pauses.
 Transaction 2 begins, adds a child tuple that references the parent
 tuple (checking that the parent exists, which it does), and commits.
 Transaction 1 commits.

 Will SELECT ... FOR SHARE not help?

 Try it, with the example above.  I think you'll find that it doesn't.

 TXA = delete from foo;
 DELETE 1

 TXB = select a from foo for share; -- waits

 What am I missing?

Slightly verbose example of what can go wrong:

CREATE TABLE a (i int PRIMARY KEY);
INSERT INTO a VALUES (1);

CREATE TABLE b (a_id int);

 Start with T1:

T1 BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN
T1 SELECT i FROM a WHERE i = 1 FOR SHARE; -- Does a with i = 1 exist?
 i
---
 1
(1 Zeile)

T1 INSERT INTO b VALUES (1); -- Great, it existed, insert row
pointing to it in b.
INSERT 0 1

 Switch to T2:

T2 BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; -- Evil
transaction T2 is intervening!
BEGIN
T2 SELECT i FROM a WHERE i = 1 FOR SHARE; -- Lock a with i = 1 FOR SHARE.
 i
---
 1
(1 Zeile)

T2 SELECT a_id FROM b WHERE a_id = 1; -- Check whether it's got
anything pointing to it.
 a_id
--
(0 Zeilen)

T2 DELETE FROM a WHERE i = 1; -- Nope, so delete a with i = 1 (this
blocks, because T1 is still holding the lock).

 Switch to T1:

1 COMMIT; -- Commit the insertion of a row pointing to a with i = 1
(this releases all locks that T1 is holding).
COMMIT

 T2 continues:

DELETE 1
T2 COMMIT; -- Commit the deletion of a with i = 1.
COMMIT
T2 SELECT * FROM b EXCEPT SELECT * FROM a; -- Check for inconsistencies.
 a_id
--
1
(1 Zeile)

Woops.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partitioning/inherited tables vs FKs

2010-05-11 Thread Nicolas Barbier
2010/5/11 Marko Tiikkaja marko.tiikk...@cs.helsinki.fi:

 This is getting way off topic, but:

 On 5/11/10 3:55 PM +0300, Nicolas Barbier wrote:

 T2  SELECT i FROM a WHERE i = 1 FOR SHARE; -- Lock a with i = 1 FOR
 SHARE.
  i
 ---
  1
 (1 Zeile)

 T2  SELECT a_id FROM b WHERE a_id = 1; -- Check whether it's got
 anything pointing to it.
  a_id
 --
 (0 Zeilen)

 T2  DELETE FROM a WHERE i = 1; -- Nope, so delete a with i = 1 (this
 blocks, because T1 is still holding the lock).

 Obviously you wouldn't delete anything with a SHARE lock.

So where would you put a SELECT ... FOR SHARE to fix the problem? (Per
Will SELECT ... FOR SHARE not help?.) I agree that my second FOR
SHARE doesn't really make a lot of sense, but that doesn't disprove
the fact that the first FOR SHARE fails to ensure consistency.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] system table/view and sequence

2010-04-07 Thread Nicolas Barbier
2010/4/7 Olivier Baheux olivierbah...@gmail.com:

 i'm trying to find where are stored sequence definition
 (increment,minvalue,maxvalue,start,cache) in system tables. Atm I
 found everything exept sequence.

It's in the sequence itself (which can be accessed like a table). The
fact that this table is in fact a sequence is stored in pg_class:

8
itsme=# CREATE TABLE a (i serial);
HINWEIS:  CREATE TABLE erstellt implizit eine Sequenz »a_i_seq« für
die »serial«-Spalte »a.i«
CREATE TABLE
itsme=# SELECT * FROM a_i_seq;
 sequence_name | last_value | increment_by |  max_value  |
min_value | cache_value | log_cnt | is_cycled | is_called
---++--+-+---+-+-+---+---
 a_i_seq   |  1 |1 | 9223372036854775807 |
1 |   1 |   1 | f | f
(1 Zeile)

itsme=# SELECT relkind FROM pg_class WHERE relname = 'a_i_seq';
 relkind
-
 S
(1 Zeile)
8

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: preloading of ispell dictionary

2010-03-23 Thread Nicolas Barbier
2010/3/23 Pavel Stehule pavel.steh...@gmail.com:

 2010/3/23 Takahiro Itagaki itagaki.takah...@oss.ntt.co.jp:

 The fundamental issue seems to be in the slow initialization of
 dictionaries. If so, how about adding a pre-complile tool to convert
 a dictionary into a binary file, and each backend simply mmap it?

 It means loading about 25MB from disc. for every first tsearch query -
 sorry, I don't believe can be good.

The operating system's VM subsystem should make that a non-problem.
Loading is also not the word I would use to indicate what mmap does.

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >