Re: [HACKERS] Socket problem using beta2 on Windows-XP
IIRC, the win32 installer will enable autovacuum by default. And yes, autovacuum was my first thought as well after Thomas last mail - that would be a good explanation to why it happens when the postmaster is idle. I used the win32 installer defaults so autovacuum is probably a safe assumption. Right. Please try turning it off and see if the problem goes away. //Magnus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Socket problem using beta2 on Windows-XP
Magnus Hagander wrote: Right. Please try turning it off and see if the problem goes away. It does (go away). - thomas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Socket problem using beta2 on Windows-XP
Magnus Hagander wrote: Right. Please try turning it off and see if the problem goes away. No, wait! It does *not* go away. Do I need to do anything more than setting this in my postgresql.conf file: autovacuum = false# enable autovacuum subprocess? and restart the service? The two zombie entries occurs directly when I start the service, then there's two new entries popping up every minute. - thomas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Socket problem using beta2 on Windows-XP
Right. Please try turning it off and see if the problem goes away. No, wait! It does *not* go away. Do I need to do anything more than setting this in my postgresql.conf file: autovacuum = false# enable autovacuum subprocess? and restart the service? The two zombie entries occurs directly when I start the service, then there's two new entries popping up every minute. Yes, that should be enough. Hmm. Weird! If you can get a backtrace from the point where the error msg shows up, that certainly would help - this means it's not coming from where we thought it was coming from :-( //Magnus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] FW: PGBuildfarm member snake Branch HEAD Status changed from OK to Contrib failure
Looks like there more unprotable code in the recent changes to pgbench :-( Regards, Dave. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 30 September 2005 02:17 To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: PGBuildfarm member snake Branch HEAD Status changed from OK to Contrib failure The PGBuildfarm member snake had the following event on branch HEAD: Status changed from OK to Contrib failure The snapshot timestamp for the build that triggered this notification is: 2005-09-30 01:00:01 The specs of this machine are: OS: Windows / Server 2003 SP1 Arch: i686 Comp: gcc / 3.4.2 For more information, see http://www.pgbuildfarm.org/cgi-bin/show_history.pl?nm=snakebr=HEAD ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Socket problem using beta2 on Windows-XP
On Fri, Sep 30, 2005 at 08:29:07AM +0200, Thomas Hallgren wrote: Magnus Hagander wrote: Right. Please try turning it off and see if the problem goes away. No, wait! It does *not* go away. Do I need to do anything more than setting this in my postgresql.conf file: autovacuum = false# enable autovacuum subprocess? and restart the service? The two zombie entries occurs directly when I start the service, then there's two new entries popping up every minute. If it's two zombies per minute, then I bet it's the stat collector and stat bufferer. They are restarted by the postmaster if not found to be running. The weird thing is that the postmaster _should_ call wait() for them if it detects that they died (when receiving a SIGCHLD signal AFAIR). If it doesn't, maybe it indicates there's a problem with the signal handling on Win32. -- Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 17.7, W 73º 14' 26.8 We are who we choose to be, sang the goldfinch when the sun is high (Sandman) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Install pg_regress script to support PGXS?
Dear Tom, While testing the recent pgxs patch, I noticed that you can build and install contrib with PGXS: ... It seems that it ought to work to run installcheck too: gmake USE_PGXS=1 installcheck but this does not quite work because the pg_regress script isn't included in the installation tree. (If you copy it to where it'd need to be, installcheck works.) Is it worth including pg_regress in the installation to make this work? Seems like it might be handy for external modules to be able to run self-tests. Indeed, I noticed this when I did pgxs;-) It is on my todo list for postgresql, but I haven't had time much time this year to contribute anything. The reason it was not included at first is that it seemed to me that selftests require a temporary installation which make sense easilly when building from sources, but I was not sure about what would be really needed for the feature out of the compilation tree (special conf files, diff/comparison commands...). I was planning to investigate the details, but if the pg_regress command is enough, that was a little bit stupid of me not to do it directly. ISTM that the pg_regress command is not compiled by default, only under make test, so it should be added to the default compilation? -- Fabien. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
From: Pailloncy Jean-Gerard [EMAIL PROTECTED] Sent: Sep 29, 2005 7:11 AM Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Jeff Baker: Your main example seems to focus on a large table where a key column has constrained values. This case is interesting in proportion to the number of possible values. If I have billions of rows, each having one of only two values, I can think of a trivial and very fast method of returning the table sorted by that key: make two sequential passes, returning the first value on the first pass and the second value on the second pass. This will be faster than the method you propose. Ron Peacetree: 1= No that was not my main example. It was the simplest example used to frame the later more complicated examples. Please don't get hung up on it. 2= You are incorrect. Since IO is the most expensive operation we can do, any method that makes two passes through the data at top scanning speed will take at least 2x as long as any method that only takes one such pass. You do not get the point. As the time you get the sorted references to the tuples, you need to fetch the tuples themself, check their visbility, etc. and returns them to the client. As PFC correctly points out elsewhere in this thread, =maybe= you have to do all that. The vast majority of the time people are not going to want to look at a detailed record by record output of that much data. The most common usage is to calculate or summarize some quality or quantity of the data and display that instead or to use the tuples or some quality of the tuples found as an intermediate step in a longer query process such as a join. Sometimes there's a need to see _some_ of the detailed records; a random sample or a region in a random part of the table or etc. It's rare that there is a RW need to actually list every record in a table of significant size. On the rare occasions where one does have to return or display all records in such large table, network IO and/or display IO speeds are the primary performance bottleneck. Not HD IO. Nonetheless, if there _is_ such a need, there's nothing stopping us from rearranging the records in RAM into sorted order in one pass through RAM (using at most space for one extra record) after constructing the cache conscious Btree index. Then the sorted records can be written to HD in RAM buffer sized chunks very efficiently. Repeating this process until we have stepped through the entire data set will take no more HD IO than one HD scan of the data and leave us with a permanent result that can be reused for multiple purposes. If the sorted records are written in large enough chunks, rereading them at any later time can be done at maximum HD throughput In a total of two HD scans (one to read the original data, one to write out the sorted data) we can make a permanent rearrangement of the data. We've essentially created a cluster index version of the data. So, if there is only 2 values in the column of big table that is larger than available RAM, two seq scans of the table without any sorting is the fastest solution. If you only need to do this once, yes this wins. OTOH, if you have to do this sort even twice, my method is better. regards, Ron ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
From: Zeugswetter Andreas DAZ SD [EMAIL PROTECTED] Sent: Sep 29, 2005 9:28 AM Subject: RE: [HACKERS] [PERFORM] A Better External Sort? In my original example, a sequential scan of the 1TB of 2KB or 4KB records, = 250M or 500M records of data, being sorted on a binary value key will take ~1000x more time than reading in the ~1GB Btree I described that used a Key+RID (plus node pointers) representation of the data. Imho you seem to ignore the final step your algorithm needs of collecting the data rows. After you sorted the keys the collect step will effectively access the tuples in random order (given a sufficiently large key range). Collecting the data rows can be done for each RAM buffer full of of data in one pass through RAM after we've built the Btree. Then if desired those data rows can be read out to HD in sorted order in essentially one streaming burst. This combination of index build + RAM buffer rearrangement + write results to HD can be repeat as often as needed until we end up with an overall Btree index and a set of sorted sublists on HD. Overall HD IO for the process is only two effectively sequential passes through the data. Subsequent retrieval of the sorted information from HD can be done at full HD streaming speed and whatever we've decided to save to HD can be reused later if we desire. Hope this helps, Ron ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] A Better External Sort?
From: Josh Berkus josh@agliodbs.com Sent: Sep 29, 2005 12:54 PM Subject: Re: [HACKERS] [PERFORM] A Better External Sort? The biggest single area where I see PostgreSQL external sort sucking is on index creation on large tables. For example, for free version of TPCH, it takes only 1.5 hours to load a 60GB Lineitem table on OSDL's hardware, but over 3 hours to create each index on that table. This means that over all our load into TPCH takes 4 times as long to create the indexes as it did to bulk load the data. Hmmm. 60GB/5400secs= 11MBps. That's ssllooww. So the first problem is evidently our physical layout and/or HD IO layer sucks. Creating the table and then creating the indexes on the table is going to require more physical IO than if we created the table and the indexes concurrently in chunks and then combined the indexes on the chunks into the overall indexes for the whole table, so there's a potential speed-up. The method I've been talking about is basically a recipe for creating indexes as fast as possible with as few IO operations, HD or RAM, as possible and nearly no random ones, so it could help as well. OTOH, HD IO rate is the fundamental performance metric. As long as our HD IO rate is pessimal, so will the performance of everything else be. Why can't we load a table at closer to the peak IO rate of the HDs? Anyone restoring a large database from pg_dump is in the same situation. Even worse, if you have to create a new index on a large table on a production database in use, because the I/O from the index creation swamps everything. Fix for this in the works ;-) Following an index creation, we see that 95% of the time required is the external sort, which averages 2mb/s. Assuming decent HD HW, this is HORRIBLE. What's kind of instrumenting and profiling has been done of the code involved? This is with seperate drives for the WAL, the pg_tmp, the table and the index. I've confirmed that increasing work_mem beyond a small minimum (around 128mb) had no benefit on the overall index creation speed. No surprise. The process is severely limited by the abyssmally slow HD IO. Ron ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] postgresql clustering
Thanks for your reply Luke. Bizgres looks like a very promissing project. I'll be sure to follow it. Thanks to everyone for their comments. I'm starting to understand the truth behind the hype and where these performance gains and hits stem from. -Dan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] postgresql clustering
What about clustered filesystems? At first blush I would think the overhead of something like GFS might kill performance. Could one potentially achieve a fail-over config using multiple nodes with GFS, each having there own instance of PostgreSQL (but only one running at any given moment)? Best, Dan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
Just to add a little anarchy in your nice debate... Who really needs all the results of a sort on your terabyte table ? I guess not many people do a SELECT from such a table and want all the results. So, this leaves : - Really wanting all the results, to fetch using a cursor, - CLUSTER type things, where you really want everything in order, - Aggregates (Sort-GroupAggregate), which might really need to sort the whole table. - Complex queries where the whole dataset needs to be examined, in order to return a few values - Joins (again, the whole table is probably not going to be selected) - And the ones I forgot. However, Most likely you only want to SELECT N rows, in some ordering : - the first N (ORDER BY x LIMIT N) - last N (ORDER BY x DESC LIMIT N) - WHERE xvalue ORDER BY x LIMIT N - WHERE xvalue ORDER BY x DESC LIMIT N - and other variants Or, you are doing a Merge JOIN against some other table ; in that case, yes, you might need the whole sorted terabyte table, but most likely there are WHERE clauses in the query that restrict the set, and thus, maybe we can get some conditions or limit values on the column to sort. Also the new, optimized hash join, which is more memory efficient, might cover this case. Point is, sometimes, you only need part of the results of your sort. And the bigger the sort, the most likely it becomes that you only want part of the results. So, while we're in the fun hand-waving, new algorithm trying mode, why not consider this right from the start ? (I know I'm totally in hand-waving mode right now, so slap me if needed). I'd say your new, fancy sort algorithm needs a few more input values : - Range of values that must appear in the final result of the sort : none, minimum, maximum, both, or even a set of values from the other side of the join, hashed, or sorted. - LIMIT information (first N, last N, none) - Enhanced Limit information (first/last N values of the second column to sort, for each value of the first column) (the infamous top10 by category query) - etc. With this, the amount of data that needs to be kept in memory is dramatically reduced, from the whole table (even using your compressed keys, that's big) to something more manageable which will be closer to the size of the final result set which will be returned to the client, and avoid a lot of effort. So, this would not be useful in all cases, but when it applies, it would be really useful. Regards ! ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Fwd: Re: [HACKERS] postgresql clustering
What is the relationship between database support for clustering and grid computing and support for distributed databases? Two-phase COMMIT is comming in 8.1. What effect will this have in promoting FOSS grid support or distribution solutions for Postgresql? ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Install pg_regress script to support PGXS?
On Thu, Sep 29, 2005 at 03:24:15PM +0200, Fabien COELHO wrote: Dear Tom, While testing the recent pgxs patch, I noticed that you can build and install contrib with PGXS: ... It seems that it ought to work to run installcheck too: gmake USE_PGXS=1 installcheck but this does not quite work because the pg_regress script isn't included in the installation tree. (If you copy it to where it'd need to be, installcheck works.) Is it worth including pg_regress in the installation to make this work? Seems like it might be handy for external modules to be able to run self-tests. The reason it was not included at first is that it seemed to me that selftests require a temporary installation [...] The temporary installation is required to run make check, but not to run make installcheck (which uses the regular installation instead.) So just including pg_regress is enough, and since it's a small and useful program, I'd say it's a reasonable thing to do. (Maybe we should audit it for security problems, just in case the distributors choose to include it in packages. The temp file usage for $TMPFILE looks predictable, thus maybe it _is_ insecure. Apparently it can be made to clobber any existing file owned by whoever runs it.) -- Alvaro Herrerahttp://www.PlanetPostgreSQL.org Entristecido, Wutra (canción de Las Barreras) echa a Freyr a rodar y a nosotros al mar ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Install pg_regress script to support PGXS?
Alvaro Herrera [EMAIL PROTECTED] writes: (Maybe we should audit it for security problems, just in case the distributors choose to include it in packages. The temp file usage for $TMPFILE looks predictable, thus maybe it _is_ insecure. Apparently it can be made to clobber any existing file owned by whoever runs it.) pg_regress already *is* shipped in packages (certainly the RPMs contain it) so if you see any such problems, please do fix 'em. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Socket problem using beta2 on Windows-XP
Alvaro Herrera [EMAIL PROTECTED] writes: If it's two zombies per minute, then I bet it's the stat collector and stat bufferer. They are restarted by the postmaster if not found to be running. That would make some sense, because the stat processes need to set up new sockets (for the pipe between them). The autovacuum theory didn't hold any water in my eyes because autovacuum doesn't create any new sockets. However, why two zombies? That would mean that the grandchild process started, which should mean that the pipe was already created ... Does Windows have any equivalent of strace whereby we could watch what's happening during stats process launch? regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Socket problem using beta2 on Windows-XP
If it's two zombies per minute, then I bet it's the stat collector and stat bufferer. They are restarted by the postmaster if not found to be running. That would make some sense, because the stat processes need to set up new sockets (for the pipe between them). The autovacuum theory didn't hold any water in my eyes because autovacuum doesn't create any new sockets. However, why two zombies? That would mean that the grandchild process started, which should mean that the pipe was already created ... Does Windows have any equivalent of strace whereby we could watch what's happening during stats process launch? First of all, I won't be able to dig into this any more until next week - sorry about that. But others are always free to :-) There is no strace equivalent builtin, but you can get an addon from http://www.bindview.com/Services/RAZOR/Utilities/Windows/strace_readme.c fm. Don't put it on a production box permanently, though, it tends to cause BSODs in some cases. //Magnus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Found small issue with OUT params
Tom Lane wrote: Mike Rylander [EMAIL PROTECTED] writes: Using that logic, a functions with one OUT param would be the same as a function returning a rowtype with only one column, But it's not (and no, I don't want to make it so, because the overhead for the useless record result would be significant). regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend Tom, I hardly think the overhead would be significant on modern processors, I don't think the majority of users are running on Pentium 90s.( I am assuming you mean a performance overhead) The whole point is the current behavior is inconsistent and not expected and should be changed to be inline with the way other DB systems work. What is the point of even allowing a single OUT param then? You might as well just raise a error and tell the user that a single OUT param is not allowed. 8.1 is going to bring even more users over from systems like Firebird, MS SQL and even Oracle, and all of these allow a single OUT param and it returns the name of the OUT param, not the name of the function. Like I said before this behavior is going to make it more difficult to port applications from other systems. How difficult can it be to check if the function has a single OUT param as compared to the old way of using RETURN? Sorry if I am being a pain in the you know what, but I really think I am correct on this matter. Thanks, Tony ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Found small issue with OUT params
On Fri, Sep 30, 2005 at 10:20:34AM -0500, Tony Caduto wrote: Tom, I hardly think the overhead would be significant on modern processors, I don't think the majority of users are running on Pentium 90s.( I am assuming you mean a performance overhead) Um, please read the documention. Returning a tuple is *significantly* more expensive than returning a single value. You have to get the tuple descriptor, allocate memory for the tuple, fill in all the fields with your data... For a single value you just return it. See here for all the details, you really don't want to do it if you don't need to. http://www.postgresql.org/docs/8.0/interactive/xfunc-c.html#AEN30497 Now, you could fudge the parser to automatically alter the name of the value in the function but I'm have no idea how hard that would be... -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgp0BOnOk9s4S.pgp Description: PGP signature
Re: [HACKERS] Socket problem using beta2 on Windows-XP
Tom Lane wrote: However, why two zombies? That would mean that the grandchild process started, which should mean that the pipe was already created ... To clarify, I talk about the tcpview window and connections, and thus zombi-connections. They both belong to the same pid and seems to point to eachother. The actual process no longer exists (it can't be viewed anywhere). Regards, Thomas Hallgren ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
Ron, Hmmm. 60GB/5400secs= 11MBps. That's ssllooww. So the first problem is evidently our physical layout and/or HD IO layer sucks. Actually, it's much worse than that, because the sort is only dealing with one column. As I said, monitoring the iostat our top speed was 2.2mb/s. --Josh ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Found small issue with OUT params
Martijn van Oosterhout wrote: On Fri, Sep 30, 2005 at 10:20:34AM -0500, Tony Caduto wrote: Tom, I hardly think the overhead would be significant on modern processors, I don't think the majority of users are running on Pentium 90s.( I am assuming you mean a performance overhead) Um, please read the documention. Returning a tuple is *significantly* more expensive than returning a single value. You have to get the tuple descriptor, allocate memory for the tuple, fill in all the fields with your data... For a single value you just return it. See here for all the details, you really don't want to do it if you don't need to. http://www.postgresql.org/docs/8.0/interactive/xfunc-c.html#AEN30497 Now, you could fudge the parser to automatically alter the name of the value in the function but I'm have no idea how hard that would be... So you might notice little performance hit bringing back a million rows, and most of these type of single OUT params functions only return one row/value anyway. There would be zero perceivable difference in performance regardless of the extra overhead for a single value/row. As a application developer, I don't care about tuples etc, I just want it to work as expected without having to resort to hacks like creating a second OUT param that is not used, otherwise I would have to change a lot of client code where ever the OUT param is refernced by name instead of position and that is done a lot because the position is more likely to change than the name. The bottom line(regardless of any overhead or if I read the docs about returning a tuple) is that if you have a OUT param it should return that name, not the name of the function, period. Thanks, Tony ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PERFORM] A Better External Sort?
-Original Message- From: [EMAIL PROTECTED] [mailto:pgsql-hackers- [EMAIL PROTECTED] On Behalf Of PFC Sent: Thursday, September 29, 2005 9:10 AM To: [EMAIL PROTECTED] Cc: Pg Hackers; pgsql-performance@postgresql.org Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Just to add a little anarchy in your nice debate... Who really needs all the results of a sort on your terabyte table ? Reports with ORDER BY/GROUP BY, and many other possibilities. 40% of mainframe CPU cycles are spent sorting. That is because huge volumes of data require lots of energy to be meaningfully categorized. Let's suppose that instead of a terabyte of data (or a petabyte or whatever) we have 10% of it. That's still a lot of data. I guess not many people do a SELECT from such a table and want all the results. What happens when they do? The cases where it is already fast are not very important. The cases where things go into the crapper are the ones that need attention. So, this leaves : - Really wanting all the results, to fetch using a cursor, - CLUSTER type things, where you really want everything in order, - Aggregates (Sort-GroupAggregate), which might really need to sort the whole table. - Complex queries where the whole dataset needs to be examined, in order to return a few values - Joins (again, the whole table is probably not going to be selected) - And the ones I forgot. However, Most likely you only want to SELECT N rows, in some ordering : - the first N (ORDER BY x LIMIT N) - last N (ORDER BY x DESC LIMIT N) For these, the QuickSelect algorithm is what is wanted. For example: #include stdlib.h typedef double Etype; extern EtypeRandomSelect(Etype * A, size_t p, size_t r, size_t i); extern size_t RandRange(size_t a, size_t b); extern size_t RandomPartition(Etype * A, size_t p, size_t r); extern size_t Partition(Etype * A, size_t p, size_t r); /* ** ** In the following code, every reference to CLR means: ** **Introduction to Algorithms **By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest **ISBN 0-07-013143-0 */ /* ** CLR, page 187 */ Etype RandomSelect(Etype A[], size_t p, size_t r, size_t i) { size_t q, k; if (p == r) return A[p]; q = RandomPartition(A, p, r); k = q - p + 1; if (i = k) return RandomSelect(A, p, q, i); else return RandomSelect(A, q + 1, r, i - k); } size_t RandRange(size_t a, size_t b) { size_t c = (size_t) ((double) rand() / ((double) RAND_MAX + 1) * (b - a)); return c + a; } /* ** CLR, page 162 */ size_t RandomPartition(Etype A[], size_t p, size_t r) { size_t i = RandRange(p, r); Etype Temp; Temp = A[p]; A[p] = A[i]; A[i] = Temp; return Partition(A, p, r); } /* ** CLR, page 154 */ size_t Partition(Etype A[], size_t p, size_t r) { Etype x, temp; size_t i, j; x = A[p]; i = p - 1; j = r + 1; for (;;) { do { j--; } while (!(A[j] = x)); do { i++; } while (!(A[i] = x)); if (i j) { temp = A[i]; A[i] = A[j]; A[j] = temp; } else return j; } } - WHERE xvalue ORDER BY x LIMIT N - WHERE xvalue ORDER BY x DESC LIMIT N - and other variants Or, you are doing a Merge JOIN against some other table ; in that case, yes, you might need the whole sorted terabyte table, but most likely there are WHERE clauses in the query that restrict the set, and thus, maybe we can get some conditions or limit values on the column to sort. Where clause filters are to be applied AFTER the join operations, according to the SQL standard. Also the new, optimized hash join, which is more memory efficient, might cover this case. For == joins. Not every order by is applied to joins. And not every join is an equal join. Point is, sometimes, you only need part of the results of your sort. And the bigger the sort, the most likely it becomes that you only want part of the results. That is an assumption that will sometimes be true, and sometimes not. It is not possible to predict usage patterns for a general purpose database system. So, while we're in the fun hand-waving, new algorithm trying mode, why not consider this right from the start ? (I know I'm totally in hand-waving mode right now, so slap me if needed). I'd say your new, fancy sort algorithm needs a few more input values : - Range of values that must appear in the final result of the sort : none, minimum, maximum, both, or even a set of values from the other side of the join, hashed, or sorted. That will already happen (or it certainly
Re: [HACKERS] Found small issue with OUT params
So you might notice little performance hit bringing back a million rows, and most of these type of single OUT params functions only return one row/value anyway. There would be zero perceivable difference in performance regardless of the extra overhead for a single value/row. Sounds like we need a test case... up for it? Sincerely, Joshua D. Drake -- Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240 PostgreSQL Replication, Consulting, Custom Programming, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Found small issue with OUT params
On Friday 30 September 2005 11:49, Martijn van Oosterhout wrote: On Fri, Sep 30, 2005 at 10:20:34AM -0500, Tony Caduto wrote: Tom, I hardly think the overhead would be significant on modern processors, I don't think the majority of users are running on Pentium 90s.( I am assuming you mean a performance overhead) Um, please read the documention. Returning a tuple is *significantly* more expensive than returning a single value. You have to get the tuple descriptor, allocate memory for the tuple, fill in all the fields with your data... For a single value you just return it. ISTM it is better for us to be consistent with the visible behavior than to have two different behaviors for out param functions just so one can be faster. That way if people are concerned about the speed difference, they can rewrite the function without an out param... afaict, as it stands now you've given them no choice and are forcing them to handle two different scenarios. -- Robert Treat Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] A Better External Sort?
Ron, That 11MBps was your =bulk load= speed. If just loading a table is this slow, then there are issues with basic physical IO, not just IO during sort operations. Oh, yeah. Well, that's separate from sort. See multiple posts on this list from the GreenPlum team, the COPY patch for 8.1, etc. We've been concerned about I/O for a while. Realistically, you can't do better than about 25MB/s on a single-threaded I/O on current Linux machines, because your bottleneck isn't the actual disk I/O. It's CPU. Databases which go faster than this are all, to my knowledge, using multi-threaded disk I/O. (and I'd be thrilled to get a consistent 25mb/s on PostgreSQL, but that's another thread ... ) As I said, the obvious candidates are inefficient physical layout and/or flawed IO code. Yeah, that's what I thought too. But try sorting an 10GB table, and you'll see: disk I/O is practically idle, while CPU averages 90%+. We're CPU-bound, because sort is being really inefficient about something. I just don't know what yet. If we move that CPU-binding to a higher level of performance, then we can start looking at things like async I/O, O_Direct, pre-allocation etc. that will give us incremental improvements. But what we need now is a 5-10x improvement and that's somewhere in the algorithms or the code. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
Ron, On 9/30/05 1:20 PM, Ron Peacetree [EMAIL PROTECTED] wrote: That 11MBps was your =bulk load= speed. If just loading a table is this slow, then there are issues with basic physical IO, not just IO during sort operations. Bulk loading speed is irrelevant here - that is dominated by parsing, which we have covered copiously (har har) previously and have sped up by 500%, which still makes Postgres 1/2 the loading speed of MySQL. As I said, the obvious candidates are inefficient physical layout and/or flawed IO code. Yes. Until the basic IO issues are addressed, we could replace the present sorting code with infinitely fast sorting code and we'd still be scrod performance wise. Postgres' I/O path has many problems that must be micro-optimized away. Too small of an operand size compared to disk caches, memory, etc etc are the common problem. Another is lack of micro-parallelism (loops) with long enough runs to let modern processors pipeline and superscale. The net problem here is that a simple select blah from blee order by(blah.a); runs at 1/100 of the sequential scan rate. - Luke ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] FW: PGBuildfarm member snake Branch HEAD Status changed from OK to Contrib failure
On Fri, Sep 30, 2005 at 10:08:11AM +0100, Dave Page wrote: Looks like there more unprotable code in the recent changes to pgbench :-( Here, the culprits are tfind() and tsearch(). These apparently aren't portable enough, but they seem to exist on all other platforms. Maybe we could come up with a replacement on Windows? Are there simple btree/hash table functions on Windows, with a similar API? -- Alvaro Herrera Architect, http://www.EnterpriseDB.com La web junta la gente porque no importa que clase de mutante sexual seas, tienes millones de posibles parejas. Pon buscar gente que tengan sexo con ciervos incendiándose, y el computador dirá especifique el tipo de ciervo (Jason Alexander) ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] postgresql clustering
Dan, On 9/29/05 3:23 PM, Daniel Duvall [EMAIL PROTECTED] wrote: What about clustered filesystems? At first blush I would think the overhead of something like GFS might kill performance. Could one potentially achieve a fail-over config using multiple nodes with GFS, each having there own instance of PostgreSQL (but only one running at any given moment)? Interestingly - my friend Matt O'Keefe built GFS at UMN, I was one of his first customers/sponsors of the research in 1998 when I implemented an 8-node shared disk cluster on Alpha Linux using GFS and Fibre Channel. Again - it depends on what you're doing - if it's OLTP, you will spend too much time in lock management for disk access and things like Oracle RAC's CacheFusion becomes critical to reduce the number of times you have to hit disks. For warehousing/sequential scans, this kind of clustering is irrelevant. - Luke ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
I see the following routines that seem to be related to sorting. If I were to examine these routines to consider ways to improve it, what routines should I key in on? I am guessing that tuplesort.c is the hub of activity for database sorting. Directory of U:\postgresql-snapshot\src\backend\access\nbtree 08/11/2005 06:22 AM24,968 nbtsort.c 1 File(s) 24,968 bytes Directory of U:\postgresql-snapshot\src\backend\executor 03/16/2005 01:38 PM 7,418 nodeSort.c 1 File(s) 7,418 bytes Directory of U:\postgresql-snapshot\src\backend\utils\sort 09/23/2005 08:36 AM67,585 tuplesort.c 1 File(s) 67,585 bytes Directory of U:\postgresql-snapshot\src\bin\pg_dump 06/29/2005 08:03 PM31,620 pg_dump_sort.c 1 File(s) 31,620 bytes Directory of U:\postgresql-snapshot\src\port 07/27/2005 09:03 PM 5,077 qsort.c 1 File(s) 5,077 bytes ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] postgresql clustering
Luke Lonergan wrote: Dan, On 9/29/05 3:23 PM, Daniel Duvall [EMAIL PROTECTED] wrote: What about clustered filesystems? At first blush I would think the overhead of something like GFS might kill performance. Could one potentially achieve a fail-over config using multiple nodes with GFS, each having there own instance of PostgreSQL (but only one running at any given moment)? Interestingly - my friend Matt O'Keefe built GFS at UMN, I was one of his first customers/sponsors of the research in 1998 when I implemented an 8-node shared disk cluster on Alpha Linux using GFS and Fibre Channel. Again - it depends on what you're doing - if it's OLTP, you will spend too much time in lock management for disk access and things like Oracle RAC's CacheFusion becomes critical to reduce the number of times you have to hit disks. Hitting the disk is really bad. However, we have seen that consulting the network for small portions of data (e.g. locks) is even more critical. you will see that the CPU on all nodes is running at 1% or so while the network is waiting for data to be exchanged (latency) - this is the real problem. i don't know what oracle is doing in detail but they have real problem when losing a node inside the cluster (syncing again is really time consuming). For warehousing/sequential scans, this kind of clustering is irrelevant. I suggest to look at Teradata - for do really nice query partitioning on so called AMPs (we'd simply call it node). It is really nice for really ugly warehousing queries (ugly in terms of amount of data). Hans -- Cybertec Geschwinde Schönig GmbH Schöngrabern 134; A-2020 Hollabrunn Tel: +43/1/205 10 35 / 340 www.postgresql.at, www.cybertec.at ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] \d on database with a lot of tables is slow
On Wed, Sep 28, 2005 at 11:35:31AM -0400, Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: Sorry, remembered it wrong. It's 'Did not find any relation named', which appears to be in bin/psql/describe.c. It does occur when trying to do a \d on a specific table. Hmm, no immediate ideas. You haven't provided a lot of context about this --- when it happens, is it repeatable? Are they giving an exact table name or a pattern to \d? Is a schema name included in what they give to \d? What PG version are they running exactly? Sorry, had the error message wrong: ERROR: cache lookup failed for relation 1906465919 It is on an exact table name. When we retry the describe on a failure, sometimes it works and sometimes it fails again. When it fails again the relation number is different. Also, \d schema.table always returns quickly and never errors. \d table is slow and produces the error fairly often. They're using 8.0.x; I'm pretty certain it's 8.0.3. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Bug 1473, pthread python on FreeBSD
http://archives.postgresql.org/pgsql-bugs/2005-02/msg00135.php I think it may have been a bit early to disable pthread python support (http://archives.postgresql.org/pgsql-hackers/2005-09/msg01136.php), as Python was working fine on buildfarm member platypus. Maybe it's only an issue with 4.x machines? -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Vacuum questions...
On Thu, Sep 29, 2005 at 12:50:13AM +0300, Hannu Krosing wrote: On T, 2005-09-27 at 17:57 -0500, Jim C. Nasby wrote: On Tue, Sep 27, 2005 at 02:47:46PM -0400, Jan Wieck wrote: On 9/24/2005 8:17 PM, Jim C. Nasby wrote: Would it be difficult to vacuum as part of a dump? The reasoning behind this is that you have to read the table to do the dump anyway, I think aside from what's been said so far, it would be rather difficult anyway. pg_dump relies on MVCC and requires to run in one transaction to see a consistent snapshot while vacuum jiggles around with transactions in some rather non-standard way. Is this true even if they were in different connections? My (vague) understanding of the vacuum process is that it first vacuums indexes, and then vacuums the heap. actually (lazy) vacuum does this 1) scan heap, collect ctids of rows to remove 2) clean indexes 3) clean heap Since we don't dump indexes, there's nothing for backup to do while those are vacuumed, so my idea is: pg_dump: foreach (table) spawn vacuum wait for vacuum to hit heap start copy wait for analyze to finish next; probably the first heap scan of vacuum would go faster than dump as it does not have to write out anything, and the second scan ( nr 3 in above list ) would be either faster or slower, as it has to lock each page and rearrange tuples there. so it would be very hard to synchronize vacuum with either of them. Well, I guess it depends on what the dump was writing to. Also depends on available cache I expect. Is this something that could be hacked together fairly easy just for testing purposes? Would firing off a VACUUM tablename at the same time as a COPY tablename be a good enough approximation? -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] roundoff problem in time datatype
On Mon, Sep 26, 2005 at 06:23:06PM +0200, Andreas Pflug wrote: Tom Lane wrote: Dennis Bjorklund [EMAIL PROTECTED] writes: Do the sql standard say anything on the matter? It doesn't seem very helpful. AFAICS, we should interpret storing '23:59:59.99' into a TIME(0) field as a cast from TIME(2) to TIME(0), and the spec defines that as 15) If TD is the datetime data type TIME WITHOUT TIME ZONE, then let TSP be the time precision of TD. b) If SD is TIME WITHOUT TIME ZONE, then TV is SV, with implementation-defined rounding or truncation if necessary. So it's implementation-defined what we do. IMHO Since 23:59:59.99 probably means the last milliseconds of this day, as far as precision allows to express it, this should be truncated to 23:59:59, not rounded to 24:00:00. Until the last microsecond has elapsed, it's not 24 hours (you wouldn't round happy new year at 23:59:30 from a clock with minutes only either) Maybe also allow for a warning to be generated? Or some way to signal an overflow? I think it could be valid to do this, or round up to 24:00:00 or 'round up' to 00:00:00, depending on what the app was trying to accomplish. Would it be possible to allow an option to the datatype that specifies the rounding behavior, or would they need to be different datatypes? -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] roundoff problem in time datatype
On Mon, Sep 26, 2005 at 11:46:47AM -0400, Tom Lane wrote: Jochem van Dieten [EMAIL PROTECTED] writes: On 9/26/05, Dennis Bjorklund wrote: One reason is because it's what the standard demand. Could you cite that? The only thing I can find in the SQL standard is that the hour field in an INTERVAL can not exceed 23, not datetimes. SQL99 has _Table_11-Valid_values_for_datetime_fields_ _KeywordValid_values_of_datetime_fields | YEAR | 0001 to | | | | | MONTH| 01 to 12 | | | | | DAY | Within the range 1 (one) to 31, but further | constrained by the value of MONTH and YEAR fields, according to the rules for well- formed dates in the Gregorian calendar. | HOUR | 00 to 23 | | | | | MINUTE | 00 to 59 | | | | | SECOND | 00 to 61.9(N) where 9(N) indicates | the number of digits specified by time fractional seconds precision. | TIMEZONE_HOUR| -12 to 13 | | | | |_TIMEZONE_MINUTE__|_-59_to_59_| | | | NOTE 62 - Datetime data types will allow dates in the Gregorian format to be stored in the date range 0001-01-01 CE through -12-31 CE. The range for SECOND allows for as many as two leap seconds. Interval arithmetic that involves leap seconds or discontinuities in calendars will produce implementation- defined results. The urban legend about needing 2 leap seconds in the same minute has infected the standard I see. It should only allow 60. as the max value for SECOND. Note however that we feel free to exceed the spec in other aspects of this --- we exceed their year range for instance. So I don't think we necessarily have to reject '24:00:00'. Also, the spec explicitly states that arithmetic on TIME values is done modulo 24 hours. So it's correct for '23:59:59'::time + '1 second'::interval to yield '00:00:00', but this does not necessarily mean that we should cause rounding to behave that way. Depends whether you think that rounding is an arithmetic operation or not ... Does that portion of the spec also apply to plain time fields? The entire issue here only exists because there's no method to handle the overflow, unlike in a timestamp. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] FW: PGBuildfarm member snake Branch HEAD Status changed
Alvaro Herrera wrote: On Fri, Sep 30, 2005 at 10:08:11AM +0100, Dave Page wrote: Looks like there more unprotable code in the recent changes to pgbench :-( Here, the culprits are tfind() and tsearch(). These apparently aren't portable enough, but they seem to exist on all other platforms. Maybe we could come up with a replacement on Windows? Are there simple btree/hash table functions on Windows, with a similar API? Not that I can see, looking here: http://msdn.microsoft.com/library/en-us/vclib/html/vcrefRunTimeLibraryReference.asp The library found at http://sourceforge.net/project/shownotes.php?release_id=209006 seems to have what is needed, but I really don't think we can impose that extra requirement at this stage of the release cycle, do you? ISTM either we need to revert this or change it to use an API that is already known to be supported. cheers andrew ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Open items list for 8.1
On Wed, Sep 28, 2005 at 06:07:02PM -0400, Neil Conway wrote: On Wed, 2005-28-09 at 18:35 -0300, Marc G. Fournier wrote: The problem isn't whether or not they should be changed, the problem is that they were changed *during* beta AND *against* the direction that discussion on these changes went I'm not sure what you mean: what is the direction that discusson on these changes went? (If you're referring to complete vs. total, that hardly constitutes a change in direction.) ... pre-beta would have been more acceptable, but pre-feature freeze would have been much preferred I think there is an argument to be made for reverting pg_cancel_backend, since that function was released with 8.0. Personally I'm sceptical that there are very many people using that function in scripts (particularly using it in such a way that their scripts will break if the return type is changed). Since we've already made the change, I don't really see the point in reverting it, but I don't mind if someone wants to do it. I think it's just as important to work towards keeping interfaces clean as it is not to break old code. What's wrong with adding pg_cancel_backend(...) RETURNS int as an alias for the one that returns boolean, and document that it's deprecated and will be removed in the future. The same goes for Tom's timeofday() RETURNS text example. As for the other changes, I think there is absolutely no reason to revert them. Since when is making changes to the signatures of new functions forbidden during the beta period? AFAIK we don't make guarantees of backward compatibility during the beta period, nor would it be sensible to do so. We had the opportunity to fix some poor API choices, and since an initdb was already required I think making these changes for beta2 was quite reasonable. Agreed. Not making API changes now means we get to live with them for years and years. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] On Logging
On Mon, Sep 26, 2005 at 10:57:54AM -0700, Ron Mayer wrote: David Fetter wrote: ...log file formats in 8.0 * CSV * YAML * XML * Piped logs, as Apache can do * DB handle. I know this one will be controversial. [...] 1. Am I the only one who would wants an option for machine-readable logs? I'd very much like a format that can be easily loaded into a database (not necessarily the same one producing the logs :-) ) in real time and/or be visible as a table through something like dbi-link. I suppose any of the first three formats you suggest could work with dbi-link; or another alternate format * sql insert statements would work if piped logs were supported by sending it to psql. Apache seems to have the best, most flexible logging of anything out there, and should probably be used as a model. It's pretty easy to have it actually log to a database. Whatever method we decide on, I think it would be very useful if we supported multiple logging streams. I certainly wouldn't want to give up a human-readable log to get a CSV one. Is a logging mechanism the best way to do profiling? Seems like it might be better to have a more efficient, dedicated method. But I'm not against adding capabilities like per-backend logging, etc. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Open items list for 8.1
On Fri, 2005-30-09 at 17:47 -0500, Jim C. Nasby wrote: What's wrong with adding pg_cancel_backend(...) RETURNS int as an alias for the one that returns boolean, and document that it's deprecated and will be removed in the future. You can't overload functions based on their return type alone. -Neil ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Open items list for 8.1
Jim C. Nasby wrote: On Wed, Sep 28, 2005 at 06:07:02PM -0400, Neil Conway wrote: On Wed, 2005-28-09 at 18:35 -0300, Marc G. Fournier wrote: The problem isn't whether or not they should be changed, the problem is that they were changed *during* beta AND *against* the direction that discussion on these changes went I'm not sure what you mean: what is the direction that discusson on these changes went? (If you're referring to complete vs. total, that hardly constitutes a change in direction.) ... pre-beta would have been more acceptable, but pre-feature freeze would have been much preferred I think there is an argument to be made for reverting pg_cancel_backend, since that function was released with 8.0. Personally I'm sceptical that there are very many people using that function in scripts (particularly using it in such a way that their scripts will break if the return type is changed). Since we've already made the change, I don't really see the point in reverting it, but I don't mind if someone wants to do it. I think it's just as important to work towards keeping interfaces clean as it is not to break old code. What's wrong with adding pg_cancel_backend(...) RETURNS int as an alias for the one that returns boolean, and document that it's deprecated and will be removed in the future. The same goes for Tom's timeofday() RETURNS text example. We don't have the ability to have to functions that take the same parameters and return different results because there is no facility to decide which function to call based on what return value is expected, because a simple query doesn't have a return value restriction: SELECT pg_cancel_backend(); -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] State of support for back PG branches
On Mon, Sep 26, 2005 at 09:07:45PM -0700, Joshua D. Drake wrote: A nice pg_upgrade utility would make a big difference. Clearly an in-place upgrade is possible, but maintaining is hard. There are two broad ways of running a pg_upgrade project - one that is entirely independent of the main codebase and one that puts requirements on the main codebase developers (if you change $foo you provide code to translate old $foo to new $foo). Any feel for the relative difficulty of the two approaches? And how much push-back there'd be on the latter? You can do in place upgrades with Slony-I and Mammoth Replicator. With a lot more effort than a dump/restore, or presumably a pg_upgrade. I'd love to see a project that uses Slony to do an in-place migration as easy as possible. Maybe I'll get around to it in another 5 years -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
Bulk loading speed is irrelevant here - that is dominated by parsing, which we have covered copiously (har har) previously and have sped up by 500%, which still makes Postgres 1/2 the loading speed of MySQL. Let's ask MySQL 4.0 LOAD DATA INFILE blah 0 errors, 666 warnings SHOW WARNINGS; not implemented. upgrade to 4.1 duh ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
That 11MBps was your =bulk load= speed. If just loading a table is this slow, then there are issues with basic physical IO, not just IO during sort operations. As I said, the obvious candidates are inefficient physical layout and/or flawed IO code. Until the basic IO issues are addressed, we could replace the present sorting code with infinitely fast sorting code and we'd still be scrod performance wise. So why does basic IO suck so badly? Ron -Original Message- From: Josh Berkus josh@agliodbs.com Sent: Sep 30, 2005 1:23 PM To: Ron Peacetree [EMAIL PROTECTED] Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Ron, Hmmm. 60GB/5400secs= 11MBps. That's ssllooww. So the first problem is evidently our physical layout and/or HD IO layer sucks. Actually, it's much worse than that, because the sort is only dealing with one column. As I said, monitoring the iostat our top speed was 2.2mb/s. --Josh ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] Request for a force interactive mode flag (-I) for psql
Back in 2003 Bruce Momjian proposed adding a flag (-I) to psql to force it into interactive mode. (See http://archives.postgresql.org/pgsql-hackers/2003-11/msg00013.php for the thread.) The proposal was rejected because there was no proven need for it at that time. I'd like to raise this proposal again, since I think in our situation, this is the only fix for our problem. Our environment is as follows: * A large number of PostgreSQL 7.4 servers running on a variety of SuSE Linux (9.0 - 9.3) servers * A growing number of PostgreSQL 8.0.3 servers running on Windows Server 2003 servers. (We are running the native Win32 version of PostgreSQL, not the Cygwin version.) The servers are all located in remote offices. Maintenance is done remotely via SSH to a local bash command prompt where we use the command line tools: psql, pg_dump, etc. On Linux we use the native sshd; on Windows we use Cygwin to get bash, sshd, cron, etc. On Linux this works fine; on Windows, however, psql thinks it is not in an interactive console so we get no prompts, no line editing, no history, very little cursor control, etc. (I see that the missing autocomplete feature is a readline issue, but that's a topic for another posting.) This makes remote maintenance on the Windows servers much more difficult than it otherwise could be. The issue appears to be due to isatty() returning false in the Windows Cygwin environment. From other research around the web, it appears that if the app is Cygwin-aware is knows to override this check (or allow manual override) or try to do further testing, but in this case, since we are using the native Win32 version of PostgreSQL, psql doesn't do any additional testing. Also, even in the local console on Windows, running rxvt -- our preferred terminal in Windows (since it lets us make our Windows command line act just like our Linux command line grin) -- causes psql to think that there is no terminal. (Our first encounter of this no terminal problem was in trying to run psql via a local bash shell via rxvt, and originally we thought that psql was hanging. Given our reliance on psql for remote maintenance, this would have prevented our rolling out a Windows version of PostgreSQL. It was only after too much time looking at it with some low-level tools that we stumbled across the fact that psql was simply silently waiting at a command prompt rather than being hung. However, from other posts in these lists [e.g.: http://archives.postgresql.org/pgsql-patches/2004-07/msg00369.php ] it appears that other people also thought psql was hanging when it was run from a terminal program, so I suppose I shouldn't feel too bad...) Bruce's proposal and suggested code simply added a new -I flag to force psql into interactive mode (by simply setting pset.notty = 0). From everything I can find (including reading through the Cygwin code, ssh and sshd man pages and code, psql code, testing various modes of running the sshd service in Windows, changing Windows profile account permissions, etc.), adding this flag seems to be the only viable option (other than writing a replacement for psql for our Windows servers, something I'd prefer to not do). While I can obviously add this patch myself and build and maintain a custom version of psql for our own use, since there now is a true native version of PostgreSQL for Windows, I think we will see a growing list of people supporting mixed environments just like ours, and bumping into this exact same problem. Thoughts? - Bill Bill Bartlett meridianEMR, Inc. http://www.meridianemr.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
I have seen similar performance as Josh and my reasoning is as follows: * WAL is the biggest bottleneck with its default size of 16MB. Many people hate to recompile the code to change its default, and increasing checkpoint segments help but still there is lot of overhead in the rotation of WAL files (Even putting WAL on tmpfs shows that it is still slow). Having an option for bigger size is helpful to a small extent percentagewise (and frees up CPU a bit in doing file rotation) * Growing files: Even though this is OS dependent but it does spend lot of time doing small 8K block increases to grow files. If we can signal bigger chunks to grow or pre-grow to expected size of data files that will help a lot in such cases. * COPY command had restriction but that has been fixed to a large extent.(Great job) But ofcourse I have lost touch with programming and can't begin to understand PostgreSQL code to change it myself. Regards, Jignesh Ron Peacetree wrote: That 11MBps was your =bulk load= speed. If just loading a table is this slow, then there are issues with basic physical IO, not just IO during sort operations. As I said, the obvious candidates are inefficient physical layout and/or flawed IO code. Until the basic IO issues are addressed, we could replace the present sorting code with infinitely fast sorting code and we'd still be scrod performance wise. So why does basic IO suck so badly? Ron -Original Message- From: Josh Berkus josh@agliodbs.com Sent: Sep 30, 2005 1:23 PM To: Ron Peacetree [EMAIL PROTECTED] Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Ron, Hmmm. 60GB/5400secs= 11MBps. That's ssllooww. So the first problem is evidently our physical layout and/or HD IO layer sucks. Actually, it's much worse than that, because the sort is only dealing with one column. As I said, monitoring the iostat our top speed was 2.2mb/s. --Josh ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] State of support for back PG branches
On Mon, Sep 26, 2005 at 08:54:49PM -0700, Josh Berkus wrote: Tom, Or, as you say, we could take the viewpoint that there are commercial companies willing to take on the burden of supporting back releases, and the development community ought not spend its limited resources on doing that. I'm hesitant to push that idea very hard myself, because it would look too much like I'm pushing the interests of my employer Red Hat ... but certainly there's a reasonable case to be made there. Well, I think you know my opinion on this. Since there *are* commercial companies available, I think we should use them to reduce back-patching effort. I suggest that our policy should be: the community will patch two old releases, and beyond that if it's convenient, but no promises. In other words, when 8.1 comes out we'd be telling 7.3 users We'll be patching this only where we can apply 7.4 patches. Otherwise, better get a support contract. I agree, although I think there should be some time guarantees as well. I like the ~3 year number that's been tossed around. Of course, a lot of this is up to individual initiative; if someone fixes a patch so it applies back to 7.2, there's no reason not to make it available. However, there's no reason *you* should make it a priority. Yeah, I hope that commercial interests can work together on supporting things they want supported. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Database file compatability
On Wed, Sep 28, 2005 at 10:22:51AM -0400, Tom Lane wrote: Qingqing Zhou [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] wrote in message There is a possible sequence like this: ALIGNOF_LONG4 ALIGNOF_DOUBLE 8 MAXIMUM_ALIGNOF 8 vs. ALIGNOF_LONG8 ALIGNOF_DOUBLE 8 MAXIMUM_ALIGNOF 8 So we should at least check ALIGNOF_LONG as well. No, we don't need to, because we do not really care about ALIGNOF_LONG per se. We don't use long as an on-disk datatype, precisely because we don't know what size it is. We use int32 and int64. The former has align 4 on all machines AFAIK, and the latter has MAXIMUM_ALIGNOF. Is there a serious penalty associated with just checking them all? Seems like better safe than sorry... On a related note, are checks for endianness made as well? -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] PostgreSQL overall design
On Tue, Sep 27, 2005 at 07:00:14PM +0530, Gnanavel S wrote: On 9/27/05, Jonah H. Harris [EMAIL PROTECTED] wrote: Were you looking for a call graph? Yes. I want to know the list and sequence of files involved during a call. Total non-coder question, but is there an open-source utility that's capable of generating that? Seems like a useful piece of documentation to have. Also seems like it'd be completely impractical to maintain by hand. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] effective SELECT from child tables
On Wed, Sep 28, 2005 at 10:24:18PM +0200, Martijn van Oosterhout wrote: On Wed, Sep 28, 2005 at 11:00:46PM +0400, Ilia Kantor wrote: Your suggestion is essentially the same as mine.. There exists tableoid, pretty much suited to tell between tables in the case of inheritance.. I can't see a real need to add a special class classifier to each table.. This solution is a workaround. It will work, just can't make myself love it. I wonder if it would be possible to tweak the constraints exclusion code so that if it sees something of the form tableoid = X to exclude other tables... You know, assume each table has a constraint tableoid = OID. Still, it is a fairly unusual feature. Well, it's possibly a good way to do list partitioning where you can drop the partitioning key out of each partition, something I would love to have for stats.distributed.net (I'm actually working on a project that does exactly this with a UNION ALL view and rules...) -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] effective SELECT from child tables
On Wed, Sep 28, 2005 at 07:25:46PM +0100, Simon Riggs wrote: Include the Discriminator as a column in A and it will be inherited by all A1, A2, A3. e.g. concrete_class char(1) not null snip This will add 1 byte per row in your superclass... and requires no I thought char was actually stored variable-length...? I know there's a type that actually acts like char does on most databases, but I can't remember what it is off-hand (it should be mentioned in docs 8.3...) -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
25MBps should not be a CPU bound limit for IO, nor should it be an OS limit. It should be something ~100x (Single channel RAM) to ~200x (dual channel RAM) that. For an IO rate of 25MBps to be pegging the CPU at 100%, the CPU is suffering some combination of A= lot's of cache misses (cache thrash), B= lot's of random rather than sequential IO (like pointer chasing) C= lot's of wasteful copying D= lot's of wasteful calculations In fact, this is crappy enough performance that the whole IO layer should be rethought and perhaps reimplemented from scratch. Optimization of the present code is unlikely to yield a 100-200x improvement. On the HD side, the first thing that comes to mind is that DBs are -NOT- like ordinary filesystems in a few ways: 1= the minimum HD IO is a record that is likely to be larger than a HD sector. Therefore, the FS we use should be laid out with physical segments of max(HD sector size, record size) 2= DB files (tables) are usually considerably larger than any other kind of files stored. Therefore the FS we should use should be laid out using LARGE physical pages. 64KB-256KB at a _minimum_. 3= The whole 2GB striping of files idea needs to be rethought. Our tables are significantly different in internal structure from the usual FS entity. 4= I'm sure we are paying all sorts of nasty overhead for essentially emulating the pg filesystem inside another filesystem. That means ~2x as much overhead to access a particular piece of data. The simplest solution is for us to implement a new VFS compatible filesystem tuned to exactly our needs: pgfs. We may be able to avoid that by some amount of hacking or modifying of the current FSs we use, but I suspect it would be more work for less ROI. Ron -Original Message- From: Josh Berkus josh@agliodbs.com Sent: Sep 30, 2005 4:41 PM To: Ron Peacetree [EMAIL PROTECTED] Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org Subject: Re: [HACKERS] [PERFORM] A Better External Sort? Ron, That 11MBps was your =bulk load= speed. If just loading a table is this slow, then there are issues with basic physical IO, not just IO during sort operations. Oh, yeah. Well, that's separate from sort. See multiple posts on this list from the GreenPlum team, the COPY patch for 8.1, etc. We've been concerned about I/O for a while. Realistically, you can't do better than about 25MB/s on a single-threaded I/O on current Linux machines, because your bottleneck isn't the actual disk I/O. It's CPU. Databases which go faster than this are all, to my knowledge, using multi-threaded disk I/O. (and I'd be thrilled to get a consistent 25mb/s on PostgreSQL, but that's another thread ... ) As I said, the obvious candidates are inefficient physical layout and/or flawed IO code. Yeah, that's what I thought too. But try sorting an 10GB table, and you'll see: disk I/O is practically idle, while CPU averages 90%+. We're CPU-bound, because sort is being really inefficient about something. I just don't know what yet. If we move that CPU-binding to a higher level of performance, then we can start looking at things like async I/O, O_Direct, pre-allocation etc. that will give us incremental improvements. But what we need now is a 5-10x improvement and that's somewhere in the algorithms or the code. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Found small issue with OUT params
On Fri, Sep 30, 2005 at 10:53:22AM -0700, Joshua D. Drake wrote: So you might notice little performance hit bringing back a million rows, and most of these type of single OUT params functions only return one row/value anyway. There would be zero perceivable difference in performance regardless of the extra overhead for a single value/row. Sounds like we need a test case... up for it? If there is a performance difference my vote is that we bite the bullet for 8.1 and accept the performance hit rather than settle for sub-optimal behavior. Much easier to fix the performance penalty down the road than to fix the behavior. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Query in SQL statement
On Thu, Sep 29, 2005 at 09:28:38PM +0800, Christopher Kings-Lynne wrote: CREATE SEQUENCE ai_id; CREATE TABLE badusers ( id int DEFAULT nextval('ai_id') NOT NULL, UserName varchar(30), Date datetime DEFAULT '-00-00 00:00:00' NOT NULL, Reason varchar(200), Admin varchar(30) DEFAULT '-', PRIMARY KEY (id), KEY UserName (UserName), KEY Date (Date) ); Am always getting foll. Errors, ERROR: relation ai_id already exists ERROR: syntax error at or near ( at character 240 You have just copied the Mysql code to Postgresql. It will in no way work. Your default for 'Date' is illegal in postgresql and hence it must allow NULLs. There is no such thing as a 'datetime' type. There is no such thing as 'Key'. Also your mixed case identifiers won't be preserved. You want: CREATE TABLE badusers ( id SERIAL PRIMARY KEY, UserName varchar(30), Date timestamp, Reason varchar(200), Admin varchar(30) DEFAULT '-' ); CREATE INDEX UserName_Idx ON badusers(Username); CREATE INDEX Date_Idx ON badusers(Date); Actually, to preserve the case you can wrap everything in quotes: CREATE ... UserName varchar(30) Of course that means that now you have to do that in every statement that uses that field, too... SELECT username FROM badusers ERROR SELECT UserName FROM badusers bad user I suggest ditching the CamelCase and going with underline_seperators. I'd also not use the bareword id, instead using bad_user_id. And I'd name the table bad_user. But that's just me. :) -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
I have perused the tuple sort stuff. The good: The documentation of the sort algorithm from Knuth's TAOCP was beautifully done. Everyone who writes an algorithm should credit the original source like this, and also where it deviates. That was done very nicely. The bad: With random access, tape style merging is not necessary. A priority queue based merge will be a lot faster. The UGLY: Please, someone, tell me I was hallucinating. Is that code really READING AND WRITING THE WHOLE TUPLE with every sort exchange?! Maybe there is a layer of abstraction that I am somehow missing. I just can't imagine that is what it is really doing. If (somehow) it really is doing that, a pointer based sort which forms a permutation based upon the keys, would be a lot better. The fundamental algorithm itself could also be improved somewhat. Here is a {public domain} outline for an introspective quick sort that Pete Filander and I wrote some time ago and contributed to FastDB. It is written as a C++ template, but it will take no effort to make it a simple C routine. It assumes that e_type has comparison operators, so in C you would use a compare function instead. /* ** Sorting stuff by Dann Corbit and Pete Filandr. ** ([EMAIL PROTECTED] and [EMAIL PROTECTED]) ** Use it however you like. */ // // The insertion sort template is used for small partitions. // template class e_type void insertion_sort(e_type * array, size_t nmemb) { e_type temp, *last, *first, *middle; if (nmemb 1) { first = middle = 1 + array; last = nmemb - 1 + array; while (first != last) { ++first; if ((*(middle) *(first))) { middle = first; } } if ((*(array) *(middle))) { ((void) ((temp) = *(array), *(array) = *(middle), *(middle) = (temp))); } ++array; while (array != last) { first = array++; if ((*(first) *(array))) { middle = array; temp = *middle; do { *middle-- = *first--; } while ((*(first) *(temp))); *middle = temp; } } } } // // The median estimate is used to choose pivots for the quicksort algorithm // template class e_type void median_estimate(e_type * array, size_t n) { e_type temp; long unsigned lu_seed = 123456789LU; const size_tk = ((lu_seed) = 69069 * (lu_seed) + 362437) % --n; ((void) ((temp) = *(array), *(array) = *(array + k), *(array + k) = (temp))); if ((*((array + 1)) *((array { (temp) = *(array + 1); if ((*((array + n)) *((array { *(array + 1) = *(array); if ((*((temp)) *((array + n { *(array) = *(array + n); *(array + n) = (temp); } else { *(array) = (temp); } } else { *(array + 1) = *(array + n); *(array + n) = (temp); } } else { if ((*((array)) *((array + n { if ((*((array + 1)) *((array + n { (temp) = *(array + 1); *(array + 1) = *(array + n); *(array + n) = *(array); *(array) = (temp); } else { ((void) (((temp)) = *((array)), *((array)) = *((array + n)), *((array + n)) = ((temp; } } } } // // This is the heart of the quick sort algorithm used here. // If the sort is going quadratic, we switch to heap sort. // If the partition is small, we switch to insertion sort. // template class e_type void qloop(e_type * array, size_t nmemb, size_t d) { e_type temp, *first, *last; while (nmemb 50) { if (sorted(array, nmemb)) { return; } if (!d--) { heapsort(array, nmemb); return; } median_estimate(array, nmemb); first = 1 + array; last = nmemb - 1 + array; do { ++first; } while ((*(array) *(first))); do { --last; } while ((*(last) *(array))); while (last first) { ((void) ((temp) = *(last), *(last) = *(first), *(first) = (temp))); do { ++first; } while ((*(array) *(first))); do { --last; } while ((*(last) *(array))); } ((void) ((temp) = *(array), *(array) = *(last), *(last) = (temp))); qloop(last + 1, nmemb - 1 + array - last, d); nmemb = last - array; } insertion_sort(array, nmemb); } // // This heap sort is better than average because it uses Lamont's heap. // template class e_type void heapsort(e_type * array, size_t nmemb) { size_t i, child, parent; e_type temp; if (nmemb 1) { i = --nmemb /
Re: [HACKERS] PCTFree Results
On Thu, Sep 29, 2005 at 11:05:47AM -0400, Jonah H. Harris wrote: Has there been any movement on this? If not, I finally have some time to look at it. Well, here's the original thread: http://archives.postgresql.org/pgsql-hackers/2005-08/msg00637.php I think the problem is that the testing isn't going to show us much (if anything) based on applying PCTFREE across all tables. It only makes sense for tables that see a lot of updates, or Index Organized Tables (to use Oracle vernacular) with indexes spread throughout the table (I don't think there's any other way to get indexes forcibly spread throughout a table, and in any case it doesn't matter for PostgreSQL right now). So, the NOTPM numbers are probably pretty meaningless. One of the other metrics that's essentially just updating might be more interesting. Josh mentioned that there were additional stats produced by the patch, but the URL seems to be down so I can't go and check. :( It would be really useful if someone with a real-life heavy-update workload stepped up. I'm going to cc Satoshi; maybe he's not on the list. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Open items list for 8.1
On Fri, Sep 30, 2005 at 06:58:05PM -0400, Bruce Momjian wrote: We don't have the ability to have to functions that take the same parameters and return different results because there is no facility to decide which function to call based on what return value is expected, because a simple query doesn't have a return value restriction: SELECT pg_cancel_backend(); Duh, I keep forgetting that. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] \d on database with a lot of tables is slow
On Fri, 30 Sep 2005, Jim C. Nasby wrote: ERROR: cache lookup failed for relation 1906465919 It is on an exact table name. When we retry the describe on a failure, sometimes it works and sometimes it fails again. When it fails again the relation number is different. Also, \d schema.table always returns quickly and never errors. \d table is slow and produces the error fairly often. They're using 8.0.x; I'm pretty certain it's 8.0.3. We have the exact same problem on a 7.4.5 database. Some basic info on the database: psql's \d returns 424 rows, and the on-disk size of the database is about 11 GB. A standalone \dt throws the same occasional error. I've suspected that it may be caused by ongoing periodic (at least once an hour) rebuilding of certain generated tables with TRUNCATE and then INSERT INTO the_table SELECT ... inside a transaction. But I don't have any proof of that; it's just the most obvious different thing going on compared to other databases we have. It does seem like I've encountered the error less often since increasing the max_fsm_pages setting and thus had more effective VACUUM and less pg_class bloat, but OTOH I trained myself not to do \d there very often either, since it was so slow and failed so often, so that may be coincidence. :) Jon -- Jon Jensen End Point Corporation http://www.endpoint.com/ Software development with Interchange, Perl, PostgreSQL, Apache, Linux, ... ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Found small issue with OUT params
Jim C. Nasby wrote: On Fri, Sep 30, 2005 at 10:53:22AM -0700, Joshua D. Drake wrote: So you might notice little performance hit bringing back a million rows, and most of these type of single OUT params functions only return one row/value anyway. There would be zero perceivable difference in performance regardless of the extra overhead for a single value/row. Sounds like we need a test case... up for it? If there is a performance difference my vote is that we bite the bullet for 8.1 and accept the performance hit rather than settle for sub-optimal behavior. Much easier to fix the performance penalty down the road than to fix the behavior. Yes, it seems this is the concensus. I have added it to the open items list: have a single OUT parameter return the parameter name, not function name --- PostgreSQL 8.1 Open Items = Current version at http://candle.pha.pa.us/cgi-bin/pgopenitems or from http://www.postgresql.org/developer/beta. Bugs fix pg_dump --clean for roles fix foreign trigger timing issue fix ALTER SCHEMA RENAME for sequence name binding, or remove improve spinlock performance fix semantic issues of granted permissions in roles fix pgxs for spaces in file names have a single OUT parameter return the parameter name, not function name Questions - cosider O_SYNC as default when O_DIRECT exists /contrib move to pgfoundry pgindent? make sure bitmap scan optimizer settings are reasonable Documentation - -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Found small issue with OUT params
On 9/30/05, Jim C. Nasby [EMAIL PROTECTED] wrote: On Fri, Sep 30, 2005 at 10:53:22AM -0700, Joshua D. Drake wrote: So you might notice little performance hit bringing back a million rows, and most of these type of single OUT params functions only return one row/value anyway. There would be zero perceivable difference in performance regardless of the extra overhead for a single value/row. Sounds like we need a test case... up for it? If there is a performance difference my vote is that we bite the bullet for 8.1 and accept the performance hit rather than settle for sub-optimal behavior. Much easier to fix the performance penalty down the road than to fix the behavior. What about just returning the single OUT value named by the parameter, instead of special casing single-OUT functions? If I understand correctly, Tom has just added a test to make single-OUT functions look like RETURNS functions. If that were removed then we'd have what, at least by counting the responses on this thread, seems to be the desired (and expected) behaviour. Or I could just be misunderstanding the implementation again. -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] On Logging
On Fri, Sep 30, 2005 at 05:54:49PM -0500, Jim C. Nasby wrote: On Mon, Sep 26, 2005 at 10:57:54AM -0700, Ron Mayer wrote: David Fetter wrote: ...log file formats in 8.0 * CSV * YAML * XML * Piped logs, as Apache can do * DB handle. I know this one will be controversial. [...] 1. Am I the only one who would wants an option for machine-readable logs? I'd very much like a format that can be easily loaded into a database (not necessarily the same one producing the logs :-) ) in real time and/or be visible as a table through something like dbi-link. I suppose any of the first three formats you suggest could work with dbi-link; or another alternate format * sql insert statements would work if piped logs were supported by sending it to psql. Apache seems to have the best, most flexible logging of anything out there, and should probably be used as a model. It's pretty easy to have it actually log to a database. Great :) Whatever method we decide on, I think it would be very useful if we supported multiple logging streams. I certainly wouldn't want to give up a human-readable log to get a CSV one. Excellent idea. Is a logging mechanism the best way to do profiling? Seems like it might be better to have a more efficient, dedicated method. I'm not totally confident in my ability to think up everything I'd want to look at, every way I'd want to look at it in advance. I'm pretty sure I can't come up with every way everyone could want to do profiling, though. But I'm not against adding capabilities like per-backend logging, etc. How would per-backend logging work? Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 510 893 6100 mobile: +1 415 235 3778 Remember to vote! ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
On 9/30/05, Ron Peacetree [EMAIL PROTECTED] wrote: 4= I'm sure we are paying all sorts of nasty overhead for essentially emulating the pg filesystem inside another filesystem. That means ~2x as much overhead to access a particular piece of data. The simplest solution is for us to implement a new VFS compatible filesystem tuned to exactly our needs: pgfs. We may be able to avoid that by some amount of hacking or modifying of the current FSs we use, but I suspect it would be more work for less ROI. On this point, Reiser4 fs already implements a number of things which would be desirable for PostgreSQL. For example: write()s to reiser4 filesystems are atomic, so there is no risk of torn pages (this is enabled because reiser4 uses WAFL like logging where data is not overwritten but rather relocated). The filesystem is modular and extensible so it should be easy to add whatever additional semantics are needed. I would imagine that all that would be needed is some more atomicity operations (single writes are already atomic, but I'm sure it would be useful to batch many writes into a transaction),some layout and packing controls, and some flush controls. A step further would perhaps integrate multiversioning directly into the FS (the wandering logging system provides the write side of multiversioning, a little read side work would be required.). More importantly: the file system was intended to be extensible for this sort of application. It might make a good 'summer of code' project for someone next year, ... presumably by then reiser4 will have made it into the mainline kernel by then. :) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
On 9/28/05, Ron Peacetree [EMAIL PROTECTED] wrote: 2= We use my method to sort two different tables. We now have these very efficient representations of a specific ordering on these tables. A join operation can now be done using these Btrees rather than the original data tables that involves less overhead than many current methods. If we want to make joins very fast we should implement them using RD trees. For the example cases where a join against a very large table will produce a much smaller output, a RD tree will provide pretty much the optimal behavior at a very low memory cost. On the subject of high speed tree code for in-core applications, you should check out http://judy.sourceforge.net/ . The performance (insert, remove, lookup, AND storage) is really quite impressive. Producing cache friendly code is harder than one might expect, and it appears the judy library has already done a lot of the hard work. Though it is *L*GPLed, so perhaps that might scare some here away from it. :) and good luck directly doing joins with a LC-TRIE. ;) ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] Expression index ignores column statistics target
I've noticed that row count estimates for expression indexes appear to rely on default_statistics_target rather than on a column's actual statistics target. That is, if I use ALTER TABLE SET STATISTICS to increase a column's statistics target and then run ANALYZE, then estimates for non-expression-index queries improve as expected. However, queries that use an expression index remain accurate for only around the N most common values, where N is the default_statistics_target that was in effect when ANALYZE ran. I'm still rummaging through the archives looking for past discussion; is this behavior a known limitation or just an oversight? CREATE TABLE foo (x integer); CREATE INDEX foo_x_idx ON foo (x); CREATE INDEX foo_abs_x_idx ON foo (abs(x)); INSERT INTO foo (x) SELECT r1 % r2 FROM generate_series(1, 100) AS g1(r1), generate_series(1, 100) AS g2(r2); SET default_statistics_target TO 15; ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20; ANALYZE foo; SELECT most_common_vals FROM pg_stats WHERE attname = 'x'; most_common_vals - {0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18} (1 row) EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13; QUERY PLAN -- Bitmap Heap Scan on foo (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1) Recheck Cond: (x = 13) - Bitmap Index Scan on foo_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220 loops=1) Index Cond: (x = 13) Total runtime: 2.905 ms (5 rows) EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13; QUERY PLAN -- Bitmap Heap Scan on foo (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1) Recheck Cond: (abs(x) = 13) - Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220 loops=1) Index Cond: (abs(x) = 13) Total runtime: 2.875 ms (5 rows) EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18; QUERY PLAN -- Bitmap Heap Scan on foo (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1) Recheck Cond: (x = 18) - Bitmap Index Scan on foo_x_idx (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180 loops=1) Index Cond: (x = 18) Total runtime: 2.393 ms (5 rows) EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18; QUERY PLAN - Bitmap Heap Scan on foo (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1) Recheck Cond: (abs(x) = 18) - Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180 loops=1) Index Cond: (abs(x) = 18) Total runtime: 2.418 ms (5 rows) -- Michael Fuhr ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Expression index ignores column statistics target
This is expected. The main TODO items is: * Allow accurate statistics to be collected on indexes with more than one column or expression indexes, perhaps using per-index statistics Basically, we don't have multi-column or expression statistics. ANALYZE just analyzes columns, even if an expression index exists. --- Michael Fuhr wrote: I've noticed that row count estimates for expression indexes appear to rely on default_statistics_target rather than on a column's actual statistics target. That is, if I use ALTER TABLE SET STATISTICS to increase a column's statistics target and then run ANALYZE, then estimates for non-expression-index queries improve as expected. However, queries that use an expression index remain accurate for only around the N most common values, where N is the default_statistics_target that was in effect when ANALYZE ran. I'm still rummaging through the archives looking for past discussion; is this behavior a known limitation or just an oversight? CREATE TABLE foo (x integer); CREATE INDEX foo_x_idx ON foo (x); CREATE INDEX foo_abs_x_idx ON foo (abs(x)); INSERT INTO foo (x) SELECT r1 % r2 FROM generate_series(1, 100) AS g1(r1), generate_series(1, 100) AS g2(r2); SET default_statistics_target TO 15; ALTER TABLE foo ALTER COLUMN x SET STATISTICS 20; ANALYZE foo; SELECT most_common_vals FROM pg_stats WHERE attname = 'x'; most_common_vals - {0,1,2,3,4,5,6,7,8,10,9,11,12,14,13,15,16,19,17,18} (1 row) EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 13; QUERY PLAN -- Bitmap Heap Scan on foo (cost=2.72..50.28 rows=205 width=4) (actual time=0.370..1.766 rows=220 loops=1) Recheck Cond: (x = 13) - Bitmap Index Scan on foo_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.314..0.314 rows=220 loops=1) Index Cond: (x = 13) Total runtime: 2.905 ms (5 rows) EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 13; QUERY PLAN -- Bitmap Heap Scan on foo (cost=2.72..50.80 rows=205 width=4) (actual time=0.358..1.720 rows=220 loops=1) Recheck Cond: (abs(x) = 13) - Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.72 rows=205 width=0) (actual time=0.305..0.305 rows=220 loops=1) Index Cond: (abs(x) = 13) Total runtime: 2.875 ms (5 rows) EXPLAIN ANALYZE SELECT * FROM foo WHERE x = 18; QUERY PLAN -- Bitmap Heap Scan on foo (cost=2.60..49.75 rows=172 width=4) (actual time=0.312..1.442 rows=180 loops=1) Recheck Cond: (x = 18) - Bitmap Index Scan on foo_x_idx (cost=0.00..2.60 rows=172 width=0) (actual time=0.262..0.262 rows=180 loops=1) Index Cond: (x = 18) Total runtime: 2.393 ms (5 rows) EXPLAIN ANALYZE SELECT * FROM foo WHERE abs(x) = 18; QUERY PLAN - Bitmap Heap Scan on foo (cost=2.22..43.65 rows=63 width=4) (actual time=0.313..1.436 rows=180 loops=1) Recheck Cond: (abs(x) = 18) - Bitmap Index Scan on foo_abs_x_idx (cost=0.00..2.22 rows=63 width=0) (actual time=0.263..0.263 rows=180 loops=1) Index Cond: (abs(x) = 18) Total runtime: 2.418 ms (5 rows) -- Michael Fuhr ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Expression index ignores column statistics target
Michael Fuhr [EMAIL PROTECTED] writes: I've noticed that row count estimates for expression indexes appear to rely on default_statistics_target rather than on a column's actual statistics target. That is, if I use ALTER TABLE SET STATISTICS to increase a column's statistics target and then run ANALYZE, then estimates for non-expression-index queries improve as expected. However, queries that use an expression index remain accurate for only around the N most common values, where N is the default_statistics_target that was in effect when ANALYZE ran. The code does in fact honor per-column statistics targets attached to expression indexes, viz alter table myfuncindex alter column pg_expression_1 set statistics 100; This isn't documented, mainly because pg_dump doesn't promise to dump such things, which it doesn't do because I didn't want to see the pg_expression_N naming for expression index columns become graven on stone tablets. I seem to recall bringing up the question of whether we could find a less implementation-specific way of commanding this behavior, but I can't find it in the archives right now. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Expression index ignores column statistics target
On Fri, Sep 30, 2005 at 11:59:26PM -0400, Bruce Momjian wrote: This is expected. The main TODO items is: * Allow accurate statistics to be collected on indexes with more than one column or expression indexes, perhaps using per-index statistics Basically, we don't have multi-column or expression statistics. ANALYZE just analyzes columns, even if an expression index exists. But the row count estimates imply that expression index queries do use column statistics, presumably as a proxy in the absence of expression statistics. This looks like a relevant commit: http://archives.postgresql.org/pgsql-committers/2004-02/msg00124.php The behavior I observed is that the planner does appear to use column statistics when planning an expression index query, but it doesn't appear to honor a column's non-default statistics target. In other words: * Row count estimates for expression index queries (at least simple ones) are reasonably accurate for the N most common column values, where N is the value of default_statistics_target when ANALYZE was run. * Specifically setting the column's statistics target with ALTER TABLE SET STATISTICS doesn't result in better statistics for expression index queries. That difference in behavior seems odd: if default_statistics_target has an effect, why doesn't ALTER TABLE SET STATISTICS? -- Michael Fuhr ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Expression index ignores column statistics target
On Sat, Oct 01, 2005 at 12:53:03AM -0400, Tom Lane wrote: The code does in fact honor per-column statistics targets attached to expression indexes, viz alter table myfuncindex alter column pg_expression_1 set statistics 100; Aha -- that's the piece I didn't know about. I was wondering where those statistics were being stored, since they were affected by default_statistics_target but not by per-column statistics targets. And now I see them when I don't restrict queries against pg_stats by just the table or column name. Thanks. -- Michael Fuhr ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
Judy definitely rates a WOW!! -Original Message- From: [EMAIL PROTECTED] [mailto:pgsql-hackers- [EMAIL PROTECTED] On Behalf Of Gregory Maxwell Sent: Friday, September 30, 2005 7:07 PM To: Ron Peacetree Cc: Jeffrey W. Baker; pgsql-hackers@postgresql.org; pgsql- [EMAIL PROTECTED] Subject: Re: [HACKERS] [PERFORM] A Better External Sort? On 9/28/05, Ron Peacetree [EMAIL PROTECTED] wrote: 2= We use my method to sort two different tables. We now have these very efficient representations of a specific ordering on these tables. A join operation can now be done using these Btrees rather than the original data tables that involves less overhead than many current methods. If we want to make joins very fast we should implement them using RD trees. For the example cases where a join against a very large table will produce a much smaller output, a RD tree will provide pretty much the optimal behavior at a very low memory cost. On the subject of high speed tree code for in-core applications, you should check out http://judy.sourceforge.net/ . The performance (insert, remove, lookup, AND storage) is really quite impressive. Producing cache friendly code is harder than one might expect, and it appears the judy library has already done a lot of the hard work. Though it is *L*GPLed, so perhaps that might scare some here away from it. :) and good luck directly doing joins with a LC-TRIE. ;) ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings