hardlinks and junctions don't work across physical disks, only symlinks.
Where did you read this? I just looked and can see no such restriction.
There is no such restriction for junctions, I just tried it to be safe.
Andreas
---(end of
Have you tried using cc_r for that compile line? Does that help?
Alas, that is not an option available.
cc_r is specific to the AIX xlc compiler; we're using GCC, and xlc
is not available to us.
What is missing is a -lpthread .
Andreas
---(end of
I think adding 'S' to \df confuses more than it helps.
Why that? Imho it would be consistent.
Andreas
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not
Well, it's easily changed, if all that's needed is a search-and-replace.
Suggestions for a better name?
MINGW32
I think that is a bad idea. That symbol sure suggests, that you are using mingw.
Are you expecting someone who creates a VisualStudio project to define
MINGW32 ?
Andreas
Personally I don't think that any rename()-usleep loop is necessary.
I'll check the archives.
I agree the rename loop seems unnecessary. I kept it in case we hadn't
dealt with all the failure places. Should we remove them now or wait
for 8.1? Seems we should keep them in and see if we
PS: but something you *could* do in 8.0 is replace cp by gzip to
archive compressed files that way.
How about replacing the page image records with a same size dummy record
that only contains a dummy header and all 0's. If the archiver cares about
space he will use some sort of compression
Think harder... one processor != one process...
Well sure, but you don't want a spinlock in that case.
Actually you do, when the normal case is that you don't have to block.
You want it to fall through as quickly as possible in the success case
(the blocking case is going to suck no
Re-thinking the whole purpose of the additional full page images appended to
the xlog records, I now understand and agree with Tom's comment in the docs
that we don't need to include those additional full page images for PITR -
they only exist to correctly recover the database in the event of
I was wondering about this point - might it not be just as reasonable
for the copied file to *be* an exact image of pg_control? Then a very
simple variant of pg_controldata (or maybe even just adding switches to
pg_controldata itself) would enable the relevant info to be extracted
I was just looking around the net to see exactly what Oracle's PL/SQL
syntax is. It doesn't seem too unreasonable syntax-wise:
BEGIN
... controlled statements ...
EXCEPTION
WHEN exception_name THEN
... error handling
Also, since I checked and it seems that our syntax for putting tables an
d indexes in tablespaces at creation time is identical to oracle's,
perhaps we should copy them on constraints as well.
Since we're getting close to beta, can we have consensus on what I'm to
do about this?
The
I don't know if the problem is isolated to just me, but I wanted to
suggest that we use a parameter for that, which can be configured in the
postgresql.conf, with a default value if it's not set, set to
localhost.
I think you should first trace down what the problem really is --- is
Hang on, are you supposed to MOVE or COPY away WAL segments?
Copy. pg will delete them once they are archived.
Copy. pg will recycle them once they are archived.
Andreas
---(end of broadcast)---
TIP 9: the planner will ignore your desire to
If localhost isn't being resolved correctly are you seeing error
messages like this on the server log (from pqcomm.c)?
could not translate host name \%s\, service \%s\ to
address: %s
After poking around, I found the following:
The Windows pdc (==name server :-( ) does really not
I'm aiming for the minimum feature set - which means we do need to take
care over whether that set is insufficient and also to pull any part
that doesn't stand up to close scrutiny over the next few days.
As you can see, we are still chewing on NT. What PITR features are
missing? I
then on restore once all the files are restored move the
pg_control.backup to its original name. That gives us the checkpoint
wal/offset but how do we get the start/stop information. Is that not
required?
The checkpoint wal/offset is in pg_control, that is sufficient start
information.
Do we need a checkpoint after the archiving
starts but before the backup begins?
No.
Actually yes.
Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
So yes, you need one checkpoint after archiving starts. Imho turning on xlog
archiving should issue such a
Other db's have commands for:
start/end external backup
I see that the analogy to external backup was not good, since you are correct
that dba's would expect that to stop all writes, so they can safely split
their mirror or some such. Usually the expected time from start
until end external
Sorry for the stupid question, but how do I get this patch if I do not
receive the patches mails ?
The web interface html'ifies it, thus making it unusable.
Thanks
Andreas
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
The recovery mechanism doesn't rely upon you knowing 1 or 3. The
recovery reads pg_control (from the backup) and then attempts to
de-archive the appropriate xlog segment file and then starts
rollforward
Unfortunately this only works if pg_control was the first file to be
backed up (or by
My answers:
Q1: Should Portals successfully created within the failed subxact
be closed? Or should they remain open?
no for protocol level
I can understand a yes to this one for sql level, because it will be
hard to clean up by hand :-( But I like the analogy to hold cursors,
so I would
The starting a new timeline thought works for xlogs, but not for clogs.
No matter how far you go into the future, there is a small (yet
vanishing) possibility that there is a yet undiscovered committed
transaction in the future. (Because transactions are ordered in the clog
because xids are
My proposal would be:
1. Begin main transaction: BEGIN { TRANSACTION | WORK }
2. Commit main (all) transaction: COMMIT { TRANSACTION | WORK }
3. Rollback main (all) transaction: ROLLBACK { TRANSACTION }
4. Begin inner transaction: BEGIN NESTED { TRANSACTION | WORK }
5. Commit inner
As far as implementing only savepoints, look at this:
BEGIN;
BEGIN;
INSERT INTO ...;
COMMIT;
BEGIN;
INSERT INTO ...;
COMMIT;
BEGIN;
INSERT INTO ...;
COMMIT;
With savepoints, it looks pretty strange:
BEGIN;
I'd opt for BEGIN as a start of a subtransaction (no need for special
semantics in plpgsql), the corresponding END simply changes the
transaction context to the parent level.
But 'BEGIN' in plpgsql does not start a [sub]transaction, it starts a
statement block. Are we intending to change
But 'BEGIN' in plpgsql does not start a [sub]transaction, it starts a
statement block. Are we intending to change that ? I think not.
There are two possibilities:
Either BEGIN *does* start a subtransaction, or BEGIN does not. I don't
see how two nesting level hierarchies in a
Well, Tom does seem to have something with regard to StartUpIds. I feel
it is easier to force a new timeline by adding a very large number to
the LogId IF, and only if, we have performed an archive recovery. That
way, we do not change at all the behaviour of the system for people that
choose
- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope)
Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?
Imho
Well, the proposal of implementing it like holdable cursors means using
a Materialize node which, if I understand correctly, means taking the
whole result set and storing it on memory (or disk).
Would it help to hold the lock for a record that is the current cursor position,
iff this record
begin;
declare cursor c ...;
fetch 1 from c; -- returns tuple 1
begin;
fetch 1 from c; -- returns tuple 2
rollback;
fetch 1 from c; -- returns tuple 1 again
This is mightly ugly but I think it's the most usable of the options
seen so far.
My only guess is that getaddrinfo in your libc has a bug somehow that is
corrupting the stack (hance the improper backtrace), then crashing.
It could be libc on AIX, I suppose, but it strikes me as sort of odd
that nobody else ever seens this. Unless nobody else is using AIX
5.1, which
With the rule system and two underlying tables one could make it work by
hand I think.
The rule system could be used to do this, but there was some discussion of
using inherited tables to handle it. However neither handles the really hard
part of detecting queries that use only a part
We could safely sort on the hash value, but I'm not sure how effective
that would be, considering that we're talking about values that already
hashed into the same bucket --- there's likely not to be very many
distinct hash values there.
I think we can safely put that on the todo list.
The
I think an actually implementable version of this would be:
1. Don't log any index operations at all in WAL.
2. When recovering from WAL, restore all the table contents by WAL
replay. (This would of course include the system catalog contents that
describe the indexes.) Then sit there
What would be useful for that (as of 7.5) is a dummy ALTER TABLE TYPE,
viz
ALTER TABLE tab ALTER COLUMN col TYPE same-type-it-already-has
I think we should add special syntax for this purpose, since I would like to
(or someone else later on) see all possible cases of alter column
LOG: database system was shut down at 2004-05-25 15:15:44 GMT-12
For comparison, 7.4.1 on the same system says:
LOG: database system was shut down at 2004-05-25 16:03:43 NZST
Can we keep the zic database convention unchanged but change the display
format in the logs to be
If you run NTFS, it's still possible to use arbitrary links. In the Windows
world, they are called junctions. Microsoft does not provide a junction tool
for some reason (perhaps because it's limited to NTFS). A good tool, free
and with source, can be found here
It is too late to think about pushing back another month. We had this
discussion already. June 1 is it.
I thought the outcome of that discussion was June 15 ?
Can we try to do the 2PC patch now instead of waiting for subtransactions ?
Andreas
---(end of
BTW, what are your plans for state saving/reversion for the lock manager
and buffer manager? The lock state, in particular, makes these other
problems look trivial by comparison.
Why can't we keep all locks until main tx end ? Locks are not self conflicting
are they ? So the only reason to
I know, this sucks, but, I don't see any other way, other than linking
*ALL* libpq-using programs (including initdb and friends) with -K
pthread.
How about making a libpq.so (without pthread) and a thread safe
libpq_r.so ?
Andreas
---(end of
I think this argument is largely a red herring ... but if it makes you
feel better, we could change the contents of the commit timestamp to
be gettimeofday() output (seconds+microseconds) instead of just time()
output. That should be precise enough for practical purposes.
I am saying
FireBird: ALTER COLUMN column TYPE type
DB2: ALTER COLUMN column SET DATA TYPE type.
Oracle: MODIFY column type
MSSQL:ALTER COLUMN column type constraints
MySQL:Both Oracle and MSSQL
Sap: MODIFY column type
Spec: Nothing (obvious) on changing column types
* Is it really a good idea for database-wide ANALYZE to run as a single
transaction? Holding all those locks is a recipe for deadlocks, even
if they're as inoffensive as AccessShareLocks normally are.
Wasn't one idea behind that change also to not make the planner create a plan
from mixed
Basically it is updating the logs as soon as it receives the
notifications. Writing 16 MB of xlogs could take some time.
In my experience with archiving logs, 16 Mb is on the contrary way too
small for a single log. The overhead of starting e.g. a tape session
is so high that you cannot
Why do you think it useful to preload something during InitPostgres,
anyway? Any heavily used table will certainly be present in shared
buffers already, and even more surely present in kernel buffers.
And if you really want it preloaded you can issue dummy selects
with a client right after
Consider either a box with many different postgresql instances, or one
that run both postgresql and other software. Issuing sync() in that
sitaution will cause sync of a lot of data that probably doesn't need
syncing.
But it'd probably be a very good thing on a dedicated server, giving the
The only way we can support file-level hot backup is in conjunction with
PITR-style WAL log archiving. It is okay for the data area dump to be
inconsistent, so long as your recovery process includes replay of WAL
starting at some checkpoint before the filesystem dump started, and
extending
To clarify:
I'd expect a cluster to be workable, if I
- disable VACUUM until backup completed
- issue CHECKPOINT
- backup clog (CHECKPOINT and backup clog are the backup checkpoint)
- backup all datafiles (which include at least all completed transaction
data at checkpoint time)
First of all, symlinks are a pretty popular feature. Even Windows
supports what would be needed. Second of all, PostgreSQL will still
run on OSes without symlinks, tablespaces won't be available, but
PostgreSQL will still run. Since we are all using PostgreSQL without
My idea for
- Dispose names of connectiong and replace them with a pointer.
You cannot dispose the names, you can only add something to also allow pointers.
The names are in the ESQL/C standard.
Andreas
---(end of broadcast)---
TIP 2: you can get off all
For tablespaces on OS's that don't support it, I think we will have to
store the path name in the file and read it via the backend. Somehow we
should cache those lookups.
My feeling is that we need not support tablespaces on OS's without
symlinks.
To create symlinked directories
Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
drop a table space until the directory is empty.
Agreed.
How would it get to be empty? Are you thinking of some sort of connect
database to tablespace and disconnect database from tablespace
commands that would
I'm not sure I understand you correctly. The SQL standard says you can
call your statement as this:
exec sql at CONNECTION select 1;
Here CONNECTION of course is a string, the name of the connection. So,
yes, we have to maintain that list to make sure we get the right
connection.
I
I am asking for CONNECTION being a variable of data type 'connection *' rather
than 'const char *'. That would avoid name lookups.
Is that out of spec?
Yes, but the preprocessor could still add an optimization ala 'connection *' for
the hardcoded cases (exec sql set connection 'myconn1';
I do not intend to undertake raw disk tablespaces for 7.5. I'd be
interested if anyone could provide some real world benchmarking of file
system vs. raw disk. Postgres benefits a lot from kernel file system cache
at the moment.
Yes, and don't forget that pg also relys on the OS for grouping
I believe the ODBC driver uses CTID for this sort of problem. CTID is
guaranteed to exist and to be fast to access (since it's a physical
locator). Against this you have the problem that concurrent updates
of the record will move it, leaving your CTID invalid. However, that
IIRC the ctid
The question is whether we should have a GUC variable to control no
waiting on locks or add NO WAIT to specific SQL commands.
Does anyone want to vote _against_ the GUC idea for nowait locking. (We
already have two voting for such a variable.)
I vote against. We got bit by both the
I personally think a wait period in seconds would be more useful.
Milli second timeouts tend to be misused with way too low values
in this case, imho.
I understand, but GUC lost the vote. I have updated the TODO list to
indicate this. Tatsuo posted a patch to add NO WAIT to the LOCK
In both cases, the transaction either commits or rollback occurs. No
other option is possible at the end of the transaction, but in the first
style of transaction semantics you get a mid-way decision point. This
only refers to retryable errors, since errors like access rights
violations and
It seems to me, that leaving all this to the client (which implicitly
inserts savepoints) can never be as efficient as a serverside feature.
I think this is an overly narrow view of efficiency. With client
control, the client can insert savepoints whereever it needs them,
Yes, but not
Improving on not ideal would be good, and would get even closer to
full Oracle/SQLServer migration/compatibility. However, since I've never
looked at that section of code, I couldn't comment on any particular
approach nor implement such a change, so I'll shut up and be patient.
Imagine
I don't think the bgwriter is going to be able to keep up with I/O bound
backends, but I do think it can scan and set those booleans fast enough
for the backends to then perform the writes.
As long as the bgwriter does not do sync writes (which it does not,
since that would need a whole lot
People keep saying that the bgwriter mustn't write pages synchronously
because it'd be bad for performance, but I think that analysis is
faulty. Performance of what --- the bgwriter? Nonsense, the *point*
Imho that depends on the workload. For a normal OLTP workload this is
certainly
How about the typical answer on Windows ? Create an invisible Window
with an Event Handler and pass it a windows message ?
Andreas
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister
Does anyone know how Informix, where this file comes from,
handles this?
Informix puts those files in $INFORMIXDIR/incl/esql (e.g. /usr/informix/incl/esql),
so imho a /usr/postgres installation could have them somewhere under /usr/postgres
Andreas
---(end of
Ideally that path isn't taken very often. But I'm currently having a
discussion off-list with a CMU student who seems to be seeing a case
where it happens a lot. (She reports that both WALWriteLock and
WALInsertLock are causes of a lot of process blockages, which seems to
mean that a lot
Running the attached test program shows on BSD/OS 4.3:
write 0.000360
write fsync 0.001391
I think the write fsync pays for the previous write test (same filename).
write, close fsync 0.001308
open o_fsync, write0.000924
I have
Q2: New situation: Why is it not a good idea to backup the database
files of a cluster incl. all c_log and x_log (log files last) to get a
physicaly hot backup.
In principle it is the same situation like a server which is crashing
(not a once but during some time). After restoring, it
There are no such libraries. I keep hearing ICU, but that is much too
bloated.
At least it is kind of standard and also something what will be
maintained for foreseeable future, it also has a compatible license and
is available on all platforms of interest to postgresql.
And it is used
Have you looked at what is available from
http://oss.software.ibm.com/icu/ ?
Seems they have a compatible license, but use some C++.
Andreas
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
In case of WAL per database, the operations done on a shared catalog from a
backend would need flushing system WAL and database WAL to ensure such
transaction commit. Otherwise only flushing database WAL would do.
I don't think that is a good idea. If you want databases separated you should
Main needs partitioning is useful for:
- partition elimination for queries (e.g. seq scans only scan relevant partitions)
- deleting/detaching huge parts of a table in seconds
- attaching huge parts to a table in seconds (that may have been loaded with
a fast loading utility (e.g. loading
1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if
Without grouping WAL writes that does not fly. Iff however such grouping
is implemented that should deliver optimal performance. I don't think flushing
WAL to the OS early (before a tx commits) is necessary, since writing 8k or
If the background writer uses fsync, it can write and allow the buffer
to be reused and fsync later, while if we use O_SYNC, we have to wait
for the O_SYNC write to happen before reusing the buffer;
that will be slower.
You can forget O_SYNC for datafiles for now. There would simply be too
1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if
Without grouping WAL writes that does not fly. Iff however such grouping
is implemented that should deliver optimal performance. I don't think flushing
WAL to the OS early (before a tx commits) is necessary, since writing 8k or
LOG: could not bind socket for statistics collector: Cannot assign requested
address
Hmm ... that's sure the problem, but what can we do about it? ISTM that
any non-broken system ought to be able to resolve localhost. Actually
it's worse than that: your system resolved localhost and
that works well enough to make it uncommon for backends to have to
write dirty buffers for themselves. If we can, then doing all the
writes O_SYNC would not be a problem.
One problem with O_SYNC would be, that the OS does not group writes any
more. So the code would need to eighter do it's
One problem with O_SYNC would be, that the OS does not group writes any
more. So the code would need to eighter do it's own sorting and grouping
(256k) or use aio, or you won't be able to get the maximum out of the disks.
Or just run multiple writer processes, which I believe is
The only idea I have come up with is to move all buffer write operations
into a background writer process, which could easily keep track of
every file it's written into since the last checkpoint.
I fear this approach. It seems to limit a lot of design flexibility later. But
I can't
My plan is to create another background process very similar to
the checkpointer and to let that run forever basically looping over that
BufferSync() with a bool telling that it's the bg_writer.
Why not use the checkpointer itself inbetween checkpoints ?
use a min and a max dirty setting
Why not use the checkpointer itself inbetween checkpoints ?
use a min and a max dirty setting like Informix. Start writing
when more than max are dirty stop when at min. This avoids writing
single pages (which is slow, since it cannot be grouped together
by the OS).
Current approach
Or... It seems to me that we have been observing something on the order
of 10x-20x slowdown for vacuuming a table. I think this is WAY
overcompensating for the original problems, and would cause it's own
problem as mentioned above. Since the granularity of delay seems to be
the
I was wondering whether we need to keep WAL online for 2PC,
or whether only something like clog is sufficient.
What if:
1. phase 1 commit must pass the slave xid that will be used for 2nd phase
(it needs to return some sort of identification anyway)
2. the coordinator
Why would you spent time on implementing a mechanism whose ultimate
benefit is supposed to be increasing reliability and performance, when you
already realize that it will have to lock up at the slightest sight of
trouble? There are better mechanisms out there that you can use instead.
. using wb for writing out on Windows is so that we don't
get Windows' gratuitous addition of carriage returns. I will document that.
Please use the #define PG_BINARY_W from c.h which is defined
with the correct letters for all platforms (wb on Windows).
That is how Peter's comment was
The simplest senario(though there could be varations) is
[At participant(master)'s side]
Because the commit operations is done, does nothing.
[At coordinator(slave)' side]
1) After a while
2) re-establish the communication path between the
I don't think there is any way to handle cases where the master or slave
just disappears. The other machine isn't under the server's control, so
it has no way of it knowing. I think we have to allow the administrator
to set a timeout, or ask to wait indefinately, and allow them to call an
Master Slave
-- -
commit ready--
--OK
commit done-XX
is the commit done message needed ?
Of course ... how else will the Slave commit? From my
understanding, the
concept is that the
Or the slave could reject the request.
Huh? The slave has that option?? In what circumstance?
I thought the slave could reject if someone local already had the row
locked.
No, not at all. The slave would need to reject phase 1 commit ready
for this.
Andreas
When the address-of operator is applied to a thread-local variable, it
is evaluated at run-time and returns the address of the current thread's
instance of that variable. An address so obtained may be used by any
thread. When a thread terminates, any pointers to thread-local variables
in
Are those response times in the right unit? 7-10s?
No problem: http://developer.osdl.org/markw/misc/plana.out
Ok, I guess I misunderstood you. These queries are taking 0.5ms - 300ms except
for the last aggregate query which takes just over 1s.
Yes, but because this is a benchmark he
I get the following errors
gmake -C ecpglib all
gmake[4]: Entering directory
`/usr/local/postgres/pgsql/src/interfaces/ecpg/ecpglib'
../../../../src/backend/port/aix/mkldexport.sh libecpg.a libecpg.exp
gcc -O2 -pipe -Wall -Wmissing-prototypes -Wmissing-declarations -Wl,-bnoentry -
Below is the email that prompted me to add the derived files to
WIN32_DEV CVS.
However, most people don't want them in there, so I have removed them,
and updated the web page to recommend the nightly snapshots (which have
the derived files), and mentioned the tools that will be needed
I'm confused. Right on the MinGW download page is a link for bison-1.875.
Yep, but I had problems with it. Author confirmed that there could be some
problems creating processes (for example M4). However if You make it work,
I'll be interested to know how. Check the MinGW mailing list
From our previous discussion of 2-phase commit, there was concern that
the failure modes of 2-phase commit were not solvable. However, I think
multi-master replication is going to have similar non-solvable failure
modes, yet people still want multi-master replication.
No. The real
I don't think so, because the patch does nothing to keep the sort
order once the index is initially created.
As Tom mentioned, we might not want to keep the tid's in order after the
index is created because he wants the most recent tid's first, so the
expired ones migrate to the end.
In both cases ANALYZE will calculate correlation 1.0 for column X,
and something near zero for column Y. We would like to come out with
index correlation 1.0 for the left-hand case and something much less
(but, perhaps, not zero) for the right-hand case. I don't really see
a way to do this
Also, per other discussions, we are removing backend autocommit support
in 7.4. It was the wrong way to do it.
Somehow I did not see that conclusion made.
I thought, at least for JDBC, it is already successfully used ?
I think the backend autocommit is useful. Maybe only the
In fact, I had proposed a simpler UNDO capability that revisited tuples
and set their XID to a fixed aborted XID to clean up aborted
subtransactions, but most now like the multiple XID solution.
I think for the implicit subtransactions that we will want
(with error codes comming) using a
for that, we get what exactly? Fetching one row at a time is
*guaranteed* to be inefficient. The correct response if that bothers
you is to fetch multiple rows at a time, not to make a less robust
protocol.
I don't feel strongly either way on this one, but IIRC the SQL standard
for
1 - 100 of 229 matches
Mail list logo