Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote >> Can you CLUSTER >> against a minmax index? > > Not in this release, at least in my understanding. It's not yet > possible to do an ordered fetch, so the cluster scan probably won't > work. As per the patch I helped writing, CLUSTER should use the sequential heap scan+sort when "it makes sense". So I think that if the index is not able to do an ordered fetch, CLUSTER should fall back to scan+sort automatically (which is what you want in a large table anyway). Obviously, that should be tested. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778171.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote > From our discussions here, IMHO there is a strong case for avoiding > btrees completely for larger historical data tables. That isn't > something I had even considered as desirable before this conversation > but ISTM now that taking that approach will be more fruitful than > attempting to implement LSM trees. Eh? I don't understand this point. How can I avoid btrees, and searching by caller_id? I don't get it... Simon Riggs wrote > Alvaro has given me some results for his patch. The figures I have are > for a 2GB table. > > Index Build Time > MinMax 11 s > Btree 96s > > Index Size > MinMax 2 pages + metapage > Btree approx 200,000 pages + metapage > > Load time > MinMax no overhead, same as raw COPY > BTree - considerably slower Great!!! This looks very promising. Were the values indexed sequential? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778150.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Jeremy Harris wrote > Surely there's good correlation between IMSI & IMEI, so have a separate > table to translate one to (a group of) the others, and > halve the indexes on your main table? Yes; unfortunately not always both are available; but it's something we are thinking about (it requires logic in the "inserting application" that at the moment doesn't exist, but it is something that we'll have to add sooner or later). But in the end yes, trying to use less indexed-fields is a good path. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778125.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote > So in the use case you describe, the min max index would require a > scan of only 25% of the table, not the 80% described earlier for > random inserts. In my experience, people wish to keep data for much > longer periods and so the percentage of scan required would drop lower > than 25%, possibly to 5% or less for many applications. > > The plan would use sequential I/O so could still work quickly; given > the low read rate, longer query times could be acceptable. "Quickly"??? Seq scan 25% of a table of TB is not "quick". Simon Riggs wrote > Minmax indexes are simply a way to make this use case happen > automatically, without the need for manual partitioning of the table. You logic assumes that we don't index anything but the call_timestamp. That would lead to huge query response times. Plus: btree doesn't have a big problem to keep up in sequential insertion scenario (such as the call_timestamp). So I still don't see the gain in using the minmax indexes: again, could you point me to some performance tests of *any* use case? Simon Riggs wrote > They are not the answer to every prayer, but with respect they are > better than you had claimed they would be. (25% not 80%, in your use > case). I saw this was likely to be the case and this is why I > challenged you to describe in more detail. Thank you. I claimed they would scan the 80% of the table because I assumed I had to use them in the random fields; not in the call_timestamp field. I don't need a better index in the call_timestamp, because it's sequential, I don't have problems with that. But it's useless: I don't want to seq scan 25% of a multi-TB table. Simon Riggs wrote > Performance tests are only possible with a clear use case. Well, so I can add my weird_index patch in postgresql source code, and it would be committed right away??? I assumed you had to prove somehow that the new index was better than what is already available, at least for some cases. Or, in other words: what are you going to write in the minmax index documentation, "try and see if they work better for you"? Simon Riggs wrote > Please see that Alvaro and I have gone out of our way to provide a new > facility to help you and others, and that it requires changing how we > think about the solution. I accept it may not provide for every case > but it requires careful analysis before deciding that is so. If I came out too rough, I ask your pardon. I always appreciate people taking their time to help someone else for free. Plus, I'm *very* interested in the minmax index, especially for call_timestamp (some queries are date-range only, such as "give me all the calls in this particular 2 secs range) or the id column I have. But, at the same time, I don't see any evidence that they work better than btrees (except for the size of the index). I would like to see some numbers. I worked a little in the bitmap index implementation, and I stopped because on the large tables these indexes are supposed to be used, the heap lookup took so much time that the (slightly) faster index access didn't really help, because it was a fraction of the query time... I'm afraid it would be the same with minmax indexes, that's why I wanted to see some numbers... Simon Riggs wrote >> And, again, I think that random values insertion is the worst >> use case for minmax indexes. > > I agree, it is. What we have disagreed on is the extent to which that > scenario exists for use cases on very large tables, which are > typically "append-mostly" with most queries accessing a subset of the > table, e.g. date range. Mhh... maybe this is this point we don't understand each other? I query the table by userid + date range. The date range is *not* selective enough (it's, as you said, 25% of the multi-TB table). The userid is selective *a lot*. I'm looking for a "better" index for the userid column(s). The "new" indexes I mentioned in the OP claim they are better in this scenario (but I don't blindly believe them) I don't see how indexing the call_timestamp only could be of any interest, since it would require seq-scanning 25% of a huge table for every query. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778124.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote > On 5 November 2013 14:28, Leonardo Francalanci < > m_lists@ > > wrote: > >> Either my sql is not correct (likely), or my understanding of the minmax >> index is >> not correct (even more likely), or the minmax index is not usable in a >> random inputs >> scenario. > > Please show the real world SQL you intend to run, so we can comment on > it. Inventing a use case that breaks effectiveness of any optimisation > is always easy, but the question is whether the use case is likely or > even desirable. The use case is pretty simple. Think it as the NSA, as it would be much easier. Collect all calls made/received by any user on a mobile network. (in fact, it's something more than calls, so in fact is 2-10 rows per call). Keep the data for 20 days. That's the insert part. Query: search calls made/received by the user using IMSI (caller id) or IMEI (phone id). Date range is usually days (past 4 days, from 10 days ago to 5 days ago...) The result is just a very small percentage of the rows present in the table: a single user doesn't call that much! Searches are made by a human, so no that many request per second. It's not a "write mostly" scenario, it's a 99% write 1% read scenario. Problem? having 4 btree indexes on random values (imsi+imei * 2, since we have calling and caller) kills the performance in insertion after a while. Solution so far? partition every 15 minutes, create the indexes in bulk. Simon Riggs wrote > If we have a query to show the most recent calls by a particular caller > > SELECT * > FROM cdr > WHERE callerid = X > ORDER BY call_timestamp DESC > LIMIT 100 > > then this could potentially be optimised using a minmax index, by > traversing the data ranges in call_timestamp order. That is not part > of the code in this initial release, since the main use case is for > WHERE call_timestamp >= X, or WHERE primarykey = Y I don't understand how a index on call_timestamp would help in the query above. Simon Riggs wrote > I don't believe there is a credible business case for running that > same query but without the ORDER BY and LIMIT, since it could > potentially return gazillions of rows Gazillion of rows??? We're talking about calls made/received by one user here. How many calls do you make in 10 days??? Simon Riggs wrote > so it isn't surprising at all > that it would access a large % of the table. In fact, the query I use return a fraction of the table, and only a very small amount of users get searched. Simon, you keep on talking about these minmax indexes, and I still don't see any reference to some performance tests. And, again, I think that random values insertion is the worst use case for minmax indexes. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778092.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Jeff Janes wrote > Some experiments I did a few years ago showed that applying sorts to the > data to be inserted could be helpful even when the sort batch size was as > small as one tuple per 5 pages of existing index. Maybe even less. Cool!!! Do you have any idea/hint on how I could try and replicate that? Do you remember how you did it? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777056.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Claudio Freire wrote > Well, of course, they're not magic pixie dust. Of course they aren't. I think they can make a difference in a sequential input scenario. But I'm not the one who said that they are fit to solve the problems me and other people are talking about in this thread. Claudio Freire wrote > But is your data really random? (or normal?) > That's the thing... I still don't understand. Even if the data was normal, it wouldn't work. You can try and change the 3rd parameter in normal_rand and get a "narrower" distribution, but at the same time the query values should be narrower... so you'll get the same results. The problem is: if the range you get between min and max is too big in each page range, you'll end up scanning a lot of heap pages. To me, "the thing" is: has anyone made performance tests of these minmax indexes with an input that is not sequential? (BTW I'd like to see tests for the sequential input scenario too...) -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777055.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Claudio Freire wrote > real data isn't truly random Well, let's try normal_rand??? create table t1 as select trunc(normal_rand(100, 50, 3)) as n, generate_series(1, 100) as i; with cte as (select min(n) as minn, max(n) as maxn, i/100 from t1 group by i/100), inp as (select generate_series(1, 100) iinp, trunc(normal_rand(100, 50, 3)) as s) select count(*),iinp from inp left outer join cte on inp.s between minn and maxn group by iinp; Not that much different in my run... Claudio Freire wrote > you haven't really > analyzed update cost, which is what we were talking about in that last > post. I don't care for a better update cost if the cost to query is a table scan. Otherwise, I'll just claim that no index at all is even better than minmax: 0 update cost, pretty much same query time. Maybe there's value in minmax indexes for sequential data, but not for random data, which is the topic of this thread. BTW I would like to see some performance tests on the minmax indexes vs btree for the sequential inputs... is the gain worth it? I couldn't find any mention of performance tests in the minmax threads. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777020.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Claudio Freire wrote > Min-max indexes always require a sequential scan of the min-max index > itself when querying. I'm worried about the number of heap pages that will be scanned. My understanding is that given the random input, the index will not be selective enough, and will end up requiring a lot of page scanning to get the relevant rows. That is: the "selectivity" of the min/max values will be very low, given the high cardinality and randomness of the input; I'm afraid that, in most "page ranges", the min will be lower than the queried ones, and the max higher, resulting in lots of heap page scans. Quick test: a table with 1M rows, with random values from 1 to 1000: create table t as select generate_series(1, 100) as i, trunc(random() * 1000) as n; suppose a page range contains 100 rows (but my understanding is that minmax index will use a much bigger row count...), let's find how many "page ranges" should be scanned to find a series of 50 values (again, random in the 1-1000 range): with cte as (select min(n) as minn, max(n) as maxn, i/100 from t group by i/100), inp as (select generate_series(1, 50) iinp, trunc(random() * 1000) as s) select count(*) from inp left outer join cte on inp.s between minn and maxn group by iinp I get > 9000 pages for 49 values out of 50... which means scanning 90% of the table. Either my sql is not correct (likely), or my understanding of the minmax index is not correct (even more likely), or the minmax index is not usable in a random inputs scenario. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777009.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote > On 5 November 2013 09:57, Leonardo Francalanci < > m_lists@ > > wrote: >> While I do believe in testing (since "In theory there is no difference >> between theory and practice. In practice there is"), I would like to know >> the "properties" of the minmax index before trying it. >> What is it supposed to be good at? What are the pros/cons? We can't ask >> all >> the users to just "try" the index and see if it works for them. > > No, but then all users aren't suggesting we need a new index type are > they? > > I think its reasonable for you to spend time checking whether what you > require will be met, and if not, what precise query doesn't it help, > so we can better design any future new-index. I don't understand the parallel "We can't ask all the users to just try the index and see if it works for them" with "all users aren't suggesting we need a new index type". Anyway: I'm not suggesting we need a new index type. Please read my first post: I'm asking info, fearing that there's just a lot of marketing/hype in those indexes. What do you mean by "spend time checking whether what you require will be met"? Met by what, minmax indexes? Simon Riggs wrote >> As I said, my understanding is that is very efficient (both in insertion >> and >> in searching) when data is somehow ordered in the table. But maybe I got >> it >> wrong... > >> Anyway, the sql scenario is simple: a table with 4 bigint indexes; data >> in >> the fields is a random bigint in the range 1-1000. Insertion is 5-10K >> rows/second. One search every 1-5 seconds, by one of the indexed fields >> (only equality, no range). There's also an index on a timestamp field, >> but >> that's not random so it doesn't give that many problems (it's actually >> the >> one where I wanted to try the minmax). > > Without meaning to pick on you, imprecise analysis yields unhelpful > new features. The clearer we are about what we are trying to solve the > more likely we are to solve it well. 30 seconds analysis on what is > needed is not sufficient to justify an additional man year of > development, especially if a man year of work has already been done > and the testing of the latest feature is now at hand. I've never said that my analysis justifies a man year of work. As I said, I'm actually not confident at all that even if we had those "cool" indexes they would work on my scenario (marketing aside, there's not that much data/tests out there on those indexes). I just wanted to know if the matter was discussed in the past / getting more info. At the same time, I'm reluctant to try a new index hoping it will work in my case just because it's new and a man year of work has already been done. Again: what's this minmax index supposed to be good at? If it's indexing data in mostly-sequential order, it won't help me. From what I got (maybe I got it wrong?) the index stores min/max values of sequence of pages. In my case I guess that those min/max values would be close to 1 (min) /1000 (max), because I insert data in random order. So any query will scan the entire table anyway. Am I wrong? Simon Riggs wrote > The requests from the indexes field, are they ordered? Mmmh, I don't think I understand the question... an operator searches for calls made by a user, so he searches in random order... Simon Riggs wrote > are they > limited? Or you really want ALL calls? What is the tolerance on those? I want all the calls made by the user. But there won't be many calls for 1 user. And most users will never be queried (as I said, one "calls by person" search every 1-5 seconds, so a very small percentage of calls will ever be queried/retrieved) -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776982.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote > Minmax indexes seem to surprise many people, so broad generalisations > aren't likely to be useful. > > I think the best thing to do is to publish some SQL requests that > demonstrate in detail what you are trying to achieve and test them > against minmax indexes. That way we can discuss what does work and > what doesn't work well enough yet. While I do believe in testing (since "In theory there is no difference between theory and practice. In practice there is"), I would like to know the "properties" of the minmax index before trying it. What is it supposed to be good at? What are the pros/cons? We can't ask all the users to just "try" the index and see if it works for them. As I said, my understanding is that is very efficient (both in insertion and in searching) when data is somehow ordered in the table. But maybe I got it wrong... Anyway, the sql scenario is simple: a table with 4 bigint indexes; data in the fields is a random bigint in the range 1-1000. Insertion is 5-10K rows/second. One search every 1-5 seconds, by one of the indexed fields (only equality, no range). There's also an index on a timestamp field, but that's not random so it doesn't give that many problems (it's actually the one where I wanted to try the minmax). -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776976.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Simon Riggs wrote > Everybody on this thread is advised to look closely at Min Max indexes > before starting any further work. > > MinMax will give us access to many new kinds of plan, plus they are > about as close to perfectly efficient, by which I mean almost zero > overhead, with regard to inserts as it is possible to get. Simon, I don't understand how minmax indexes would help in a random-inserts scenario. While I would love to use minmax for other columns (since we also partition and search based on a timestamp, which is usually well clustered), I thought minmax index would be perfect in a mostly-incremental values scenario. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776964.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Andres Freund-3 wrote > On 2013-11-04 11:27:33 -0500, Robert Haas wrote: >> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire < > klaussfreire@ > > wrote: >> > Such a thing would help COPY, so maybe it's worth a look >> >> I have little doubt that a deferred insertion buffer of some kind >> could help performance on some workloads, though I suspect the buffer >> would have to be pretty big to make it worthwhile on a big COPY that >> generates mostly-random insertions. > > Even for random data presorting the to-be-inserted data appropriately > could result in much better io patterns. Mmh, I'm afraid that the buffer should be huge to get some real advantage. You have to buffer enough values to avoid "touching" entire pages, which is not that easy. The LSM-tree is much complicated than a simple memory-buffer + delayed inserts. The other index types that are supposed to help in the random-insertion workload rely on sequential insertions (at the expense of more writing, and slower reads) rather than re-use the btree pattern. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776963.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Gavin Flower-2 wrote > How about being able to mark indexes: > 'MEMORY ONLY' to make them not go to disk > and > 'PERSISTENT | TRANSIENT' to mark if they should be recreated on > machine bootup? I would love that. But: 1) I'd like to make some tests with a "memory drive", and confirm that in fact this would help (I'm sure I tried in the past, but I don't remember the outcome) 2) I don't know if the fact that they are in memory should be handled by the db or not. I was thinking about something more like "RECREATE IF NOT FOUND", that is: if the files aren't found at postgresql startup, re-create the index... 3) I don't know how many people would be interested (and how doable/complicated that would be, considering log-replay, replication etc etc) -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776471.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Jeff Janes wrote > True, but that is also true of indexes created in bulk. It all has to > reach disk eventually-- > [...] > If the checkpoint interval is as long as the partitioning period, then > hopefully the active index buffers get re-dirtied while protected in > shared_buffers, and only get written to disk once. Honestly, I made a lot of tests in the past, and I don't remember if I tried 15-minute checkpoints + high shared_buffers. That might work. I'm going to try it and see what happens. Jeff Janes wrote > If the buffers get read, dirtied, and evicted from a small shared_buffers > over and over again > then you are almost guaranteed that will get written to disk multiple > times (as I understand, but I might be wrong): high shared_buffers don't help because in such a random index writing, lots and lots of pages get dirtied, even if the change in the page was minimal. So, in the "15-minute" period, you write the same pages over and over again. Even if you have high shared_buffers, the same page will get sync-ed to disk multiple times (at every checkpoint). The idea of those "other" indexes is to avoid the random writing, maximizing the writing in sequence, even if that means writing more bytes. In other words: writing a full 8KB is no different than write 20 bytes in a page, as we'll have to sync the whole page anyway... I'll try a 15-minute checkpoint interval... and see what happens. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776470.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Jeff Janes wrote > You could periodically merge older partitions into larger tables, index > those aggregated tables, then transactionally disinherit the old > partitions > and inherit the new aggregated one. This would keep the value of K down, > at the expense of re-writing data multiple times (but all method write > data > multiple times, some just hide it from you). Forgot to add: I thought also that we could: - use the RAM as tablespace for indexes, and move the indexes later (but postgresql doesn't handle very well a machine crash in this case... it would be cool to create an index as "recreate on crash"...) - use unlogged tables and turn those to logged to speed up somehow the insertion; I actually started to write a patch for it, but making it work for replication was too hard (not that I'm using replication, but it wouldn't be accepted for "wal_level = minimal"). But this wouldn't solve the problem anyway. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776418.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Jeff Janes wrote > Are partitions read-only once time has moved on, or can stragglers show up > that need to be inserted into older partitions? > > You could periodically merge older partitions into larger tables, index > those aggregated tables, then transactionally disinherit the old > partitions > and inherit the new aggregated one. This would keep the value of K down, > at the expense of re-writing data multiple times (but all method write > data > multiple times, some just hide it from you). Yes, we could "merge" the partitions: the idea was to merge them during night hour, when traffic is low ( and NSA people are sleeping ;) ) Jeff Janes wrote > By the way, what is the transaction structure of your inserts? Are they > large batches between commits, or is each row committed? Of course large batches (using COPY) -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776416.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> Point being: hardware is marching along pretty fast (after 20+ years > of stagnation) and it's dangerous (IMO) to make big software > investments based on the situation on the ground *today*. Yes, that's a good point. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
Jeff Janes wrote > The index insertions should be fast until the size of the active part of > the indexes being inserted into exceeds shared_buffers by some amount > (what > that amount is would depend on how much dirty data the kernel is willing > to > allow in the page cache before it starts suffering anxiety about it). If > you have enough shared_buffers to make that last for 15 minutes, then you > shouldn't have a problem inserting with live indexes. Sooner or later you'll have to checkpoint those shared_buffers... and we are talking about GB of data (my understanding is that we change basically every btree page, resulting in re-writing of the whole index). -- View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776413.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> LSM-trees seem patent free I'm no expert, and I gave it just a look some time ago: it looked to me very complicated to get right... and as far as I remember you don't get that much gain, unless you go multi-level which would complicate things further > Please somebody advise patent status of Y-trees otherwise I wouldn't bother. y-trees look much more easier to get right... (and to me they also make more sense, but I'm not skilled enough to judge). There's also the FD-tree, which looks a lot like the (patented...) fractal tree: http://www.ntu.edu.sg/home/bshe/fdtree_pvldb.pdf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> What is the reason for needing such fast access to individual groups > of records? Sure sounds like the NSA or similar ;-) Users need to search all calls originated from/to a user or from/to a specific mobile phone to answer/analyze customers' probl... ok, I give up: I work for the NSA ;) > In terms of generality, do you think its worth a man year of developer > effort to replicate what you have already achieved? Who would pay? 1) I haven't achieved what I need: realtime indexing. I can't query the "current 15 minutes" table efficiently. Plus, K*log(N) is not that great when you have a lot of K. 2) I'm not suggesting that this is top priority. I'm asking if there's something else, other than "we don't have time for this", that I don't know. In fact, I don't even know if those indexes types would really help in my (specific) case. That's why my original question was "why aren't there developments in this area": I didn't mean to imply someone should do it. I just wanted to know if those indexes were already discussed (and maybe dismissed for some reason) in the past... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> Presumably the data you are inserting isn't actually random. Please > describe the use case you are considering in more detail and some view > on how frequent that is, with some examples. Once we understand the > use case and agree it is important, we might solve problems. Collecting calls data for mobile network operators (and no, I don't work for the NSA...) Easily 5000-1 inserts per second. Indexes in timestamp and ID (not a problem, always increasing so no btree issues) and in called #, calling #, imsi, imei. The last four obviously are random, out of millions of possible values. After the few first millions of records, the disks can't keep up with the amount of random writing in the indexes. Workaround: the table is partitioned every 15 minutes, and indexes created in bulk after we "start" the new 15-minutes partition. Searches on current 15 minutes are not allowed (as it is not indexed), and searches on older data are K*log(N) (where K is the number of partitions). Yes, I could throw more disks, use ssd, sharding more, etc etc. But I still think that btree just aren't fit for this kind of problem. I don't delete data, I don't update data, there's not that much concurrency going on. I would sacrifice search speed (K*log(N) is already much slower than "regular" btree usage) for realtime insertion. I don't think I'm the only one having a big system to be indexed by random values. In fact, I didn't want to turn this thread into a "help me with this workload" thread. I just wanted to know if there was some other known issues with these "different indexes" other than "not enough time to implement them correctly": I was afraid that someone already dismissed them as "good in theory, bad in practice"... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> Hmm, you realise Alvaro is working on MinMax indexes in this release? > They are very efficient with regard to index inserts and specially > designed for use on large tables. > > Prior work by Heikki on Grouped Item Tuples was a way of reducing the > size of indexes, yet still allowing uniqueness checks. That is > implemented in SQLServer already and is very useful. Reading the implementation of those features, I don't think they can help in the cases handled by the index types I mentioned (insertions of random values in big tables). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> Hmm, you realise Alvaro is working on MinMax indexes in this release? > They are very efficient with regard to index inserts and specially > designed for use on large tables. > > Prior work by Heikki on Grouped Item Tuples was a way of reducing the > size of indexes, yet still allowing uniqueness checks. That is > implemented in SQLServer already and is very useful. Ah! Didn't know that! > Your comment about the lack of development in indexes seems counter to > the literature that I've seen. The main problem is people keep > patenting things, making it fairly difficult for everyone. Mmh, maybe I wasn't clear: I meant lack of development (maybe I should have said "implementation"?) in postgresql and in the other "sql databases" of the fast-insertion indexes you can find in literature. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> I bet you've mis-diagnosed the problem. Btrees don't have a problem > keeping up with 50m records; you're problem is that after a certain > point your page cache can't keep up with the pseudo-random i/o > patterns and you start seeing faults to storage. > [...] This has nothing to do the btree algorithm except to the > extent it affects i/o patterns. Of course; that's why those "different" index types aim to use more sequential than random writes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> They should, in theory, be faster than btrees -- O(1) not O(log N) page > fetches per lookup. In practice they don't seem to be faster, and > nobody's bothered to find out exactly why. Again, this isn't a terribly > encouraging precedent for implementing some other index type that's > supposed to (sometimes) be faster than btrees. Yes, I understand. Which is also why I was curious to know if the "claims" those papers (and the databases using them) make were real... Thank you everybody for your replies. Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> Another point to add: I don't really see btree as a barrier to > performance for most of the problems I face. The real barriers to > database performance are storage, contention, and query planning. Ehm that's true for regular OLTP stuff, which I understand is what most (95%?) of people use/need. But if you try to insert rows into a 50M table with a couple of indexes, btrees just can't keep up. Of course, you can't have it all: fast at big table insertion, good contention, good query times... > Postgres btreee indexes are pretty fast and for stuff like bulk > insertions there are some optimization techniques available (such as > sharding or create index concurrently). At the moment I'm relying on partitioning + creating indexes in bulk on "latest" table (the partitioning is based on time). But that means K*log(N) search times (where K is the number of partitions). That's why I gave a look at these different indexing mechanisms. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast insertion indexes: why no developments
> Before getting too excited about some new academic index type, it's worth > noting the sad state in which hash indexes have languished for years. > Nobody's bothered to add WAL support, let alone do any other real work > on them. The non-btree index types that have been getting love are the > ones that offer the ability to index queries that btree can't. I think > a new index type whose only benefit is the claim to be faster in a narrow > use-case is likely to end up like hash, not getting used enough to be > properly maintained. > regards, tom lane Aren't hash indexes in a poor state because they are not faster than btree in every condition? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fast insertion indexes: why no developments
Hi, I don't see much interest in insert-efficient indexes. These are the ones I've found: - LSM-tree (used by Cassandra and SQLite4?) - Y-Tree (http://www.bossconsulting.com/oracle_dba/white_papers/DW%20in%20oracle/P23%20(ytree%20index%20structure%20for%20DWs).pdf ) - Fractal indexes (TokuDB, patented) While I understand that b*trees are still the best compromise in insertion/search speed, disk size, concurrency, and more in general in OLTP workloads, they are useless when it comes to insertion in big data tables (>50M rows) of random values (not ordered values). I would like to know if the lack of development in this area (not only in Postgresql, but in databases in general) is due to: 1) complex implementation 2) poor search performance 3) poor concurrency performance 4) not interesting for most users 5) something else??? I thought this was going to change due to the fast-insertion speeds needs of "Social Applications", but only TokuDB seems to be the only "successful" player in the area (I don't know how much of it is due to good marketing). Most other DB technology claims faster insertion speed (MongoDB and the like...) but in the end they rely on the old b*tree + sharding instead of using different indexing mechanisms (with the exception of Cassandra). Thank you in advance Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Checksums
Agreed. I do agree with Heikki that it really ought to be the OS problem, but then we thought that about dtrace and we're still waiting for that or similar to be usable on all platforms (+/- 4 years). My point is that it looks like this is going to take 1-2 years in postgresql, so it looks like a huge job... but at the same time I understand we can't "hope other filesystems will catch up"! I guess this feature will be tunable (off/on)? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Checksums
I think what I meant was: isn't this going to be useless in a couple of years (if, say, btrfs will be available)? Or it actually gives something that FS will never be able to give? Yes, it will help you find/address bugs in the filesystem. These things are not unheard of... It sounds to me like a huge job to fix some issues "not unheard of"... My point is: if we are trying to fix misbehaving drives/controllers (something that is more common than one might think), that's already done by ZFS on Solaris and FreeBSD, and will be done in btrfs for linux. I understand not trusting drives/controllers; but not trusting a filesystem... What am I missing? (I'm far from being an expert... I just don't understand...) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Checksums
On 21/12/2011 16.19, Stephen Frost wrote: * Leonardo Francalanci (m_li...@yahoo.it) wrote: I can't help in this discussion, but I have a question: how different would this feature be from filesystem-level CRC, such as the one available in ZFS and btrfs? Depends on how much you trust the filesystem. :) Ehm I hope that was a joke... I think what I meant was: isn't this going to be useless in a couple of years (if, say, btrfs will be available)? Or it actually gives something that FS will never be able to give? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Checksums
I can't help in this discussion, but I have a question: how different would this feature be from filesystem-level CRC, such as the one available in ZFS and btrfs? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Collecting statistics on CSV file data
> New API AnalyzeForeignTable I didn't look at the patch, but I'm using CSV foreign tables with named pipes to get near-realtime KPI calculated by postgresql. Of course, pipes can be read just once, so I wouldn't want an "automatic analyze" of foreign tables... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Is there any plan to add unsigned integer types?
> compression is an interesting topic: the guys over at tokudb are > making some wild claims...i'm curious if they are real, and what the > real tradeoffs are. I don't know how much of the performance they claim comes from compression and how much from the different indexing technique they use (see the my post here, where nobody answered... http://archives.postgresql.org/pgsql-general/2011-09/msg00615.php ) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Your Postgresql 9.2 patch
> I've nearly finished editing prior to commit, so no worries. Thank you, let me know if I can help. Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Your Postgresql 9.2 patch
> Leonardo, > > Your patch: > > use less space in xl_xact_commit > > ... has been waiting on an updated version from you for 10 days now. Do > you think you're likely to complete it for this CommitFest? I sent an email on the subject: http://postgresql.1045698.n5.nabble.com/use-less-space-in-xl-xact-commit-patch-tp4400729p4495125.html and the thread went on and on but now I'm confused about what to do... Can someone tell me what we want??? Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
> The important thing is that we retain backwards compatibility with > current XLOG_XACT_COMMIT. I'm not worried what we call the other one. Ok, let me see if I got it right: #define XLOG_XACT_COMMIT0x00 should become: #define XLOG_XACT_COMMIT_WITH_INFO 0x00 and I'll add a #define XLOG_XACT_COMMIT 0x60 Than I'll leave 2 structs, "xl_xact_commit_with_info" and "xl_xact_commit". Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
> With regards to the naming, I think it would be better if we kept > XLOG_XACT_COMMIT record exactly as it is now, and make the second > record an entirely new record called XLOG_XACT_COMMIT_FASTPATH. That > way we retain backwards compatibility. > > If you'd like to rework like that please, otherwise I can take it from > here if you'd like. I think I did it; while doing it, I think I've found a bug: I didn't update "recoveryStopsHere". Please double check that, as I really don't know what I'm doing there... Should I also change the struct name from xl_xact_commit to xl_xact_commit_fast_path? > > How can I test it with "weird" stuff as subtransactions, shared > > cache invalidation messages...? > > make installcheck should cover those. Ok, all tests passed. commitlog_lessbytes_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
> > On Wed, May 25, 2011 at 3:05 PM, Simon Riggs wrote: > > Leonardo, can you submit an updated version of this patch today that > > incorporates Simon's suggestion? Mmmh, maybe it was simpler than I thought; I must be missing something... patch attached How can I test it with "weird" stuff as subtransactions, shared cache invalidation messages...? Leonardo commitlog_lessbytes_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
> On Tue, Jun 14, 2011 at 2:31 PM, Simon Riggs wrote: > Well, we certainly have the option to review and commit the patch any > time up until feature freeze. However, I don't want the CommitFest > application to be full of entries for patches that are not actually > being worked on, because it makes it hard for reviewers to figure out > which patches in a state where they can be usefully looked at. AIUI, > this one is currently not, because it was reviewed three weeks ago and > hasn't been updated. Yes it's true: I thought I could find the time to work on it, but I didn't. Let me know the deadline for it, and I'll see if I can make it (or I'll make it in the next commit fest). Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
> We don't need to be in a hurry here. As the reviewer I'm happy to give > Leonardo some time, obviously no more than the end of the commit fest. > > If he doesn't respond at all, I'll do it, but I'd like to give him the > chance and the experience if possible. Sorry I couldn't update the patch (in fact, it's more of a total-rewrite than an update). How much time do I have? Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> Well, I sort of assumed the design was OK, too, but the more we talk > about this WAL-logging stuff, the less convinced I am that I really > understand the problem. :-( I see. In fact, I think nobody thought about restart points... To sum up: 1) everything seems ok when in the wal_level = minimal case. In this case, we fsync everything and at transaction commit we remove the init fork; in case of a crash, we don't reply anything (as nothing has been written to the log), and we remove the main fork as we do now. 2) in the wal_level != minimal case things become more complicated: if the standby reaches a restart point and then crashes we are in trouble: it would remove the main fork at startup, and would reply only a portion of the table. I guess the same applies to the master too? I mean: if we log HEAP_XLOG_NEWPAGEs, reach a checkpoint, and then crash, at server restart the main fork would be deleted, and the pages logged on the log couldn't be replayed. But the problem on the master can be removed using another type of log instead of HEAP_XLOG_NEWPAGE (to be replayed by the standbys only). Is this analysis correct? Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> I think > we need a detailed design document for how this is all going to work. > We need to not only handle the master properly but also handle the > slave properly. Consider, for example, the case where the slave > begins to replay the transaction, reaches a restartpoint after > replaying some of the new pages, and then crashes. If the subsequent > restart from the restartpoint blows away the main relation fork, we're > hosed. I fear we're plunging into implementation details without > having a good overall design in mind first. As I said in my first post, I'm basing the patch on the post: http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php So I assumed the design was ok (except for the "stray file around on a standby" case, which has been discussed earlier on this thread). If there are things to be discussed/analyzed (I guess the restart point thing is one of those) we can do it... but I thought that the whole design was somehow in place Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> Why is it necessary to replay the operation only on the slave? Can we > just use XLOG_HEAP_NEWPAGE? Uh, I don't know why but I thought I shouldn't log a page on the master, since all the pages are already there and fsync-ed. But if it makes no harm, I can easily use XLOG_HEAP_NEWPAGE (of course, only in the wal_level != minimal case). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> From: Noah Misch > > - the patch is missing the "send all table pages to the > > standby" part; is there some code I can use as base? > > Nothing comes to mind as especially similar. > > > I guess I have to generate some special log type that > > is only "played" by standby servers. > > What you described in your followup mail seemed reasonable. So, it's ok to have a log item that is replayed only if WalRcvInProgress() is true? Is it a correct approach? I couldn't find any other way to find out if we are in a standby or a master... > > - on the standby, the commit part should be played as it > > is on the master (that is, removing the INIT fork). > > The abort case is different though: it would mean > > doing nothing on the master, while removing every forks > > but the INIT fork on the standby. > > Would it be ok to add to xl_xact_abort a new array of > > RelFileNode(s), where for each one at abort all the forks, > > except the init fork, have to be deleted by the standby > > (while the master shouldn't do anything with them)? > > I bet there's a cleaner solution... > > Your "use less space in xl_xact_commit patch" seems to be going in a good > direction here. It would probably also be okay to do a >ResetUnloggedRelations() > on the standby at every abort of a transaction that had started an UNLOGGED -> > LOGGED conversion. That is, just a flag might be enough. ok, but that would mean that a transaction that aborts a conversion would try to reset all unlogged relations (traversing all the FS)... I don't know if that's acceptable performance-wise. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
Sorry, email sent without body. Fixed some English mistakes. commitlog_lessbytes02.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
commitlog_lessbytes02.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
> Da: Simon Riggs > I can't find a clear discussion of what you are trying to do, and how, > just a URL back to a complex discussion on another topic. While trying to write a patch to allow changing an unlogged table into a logged one, I had to add another int field to xl_xact_commit. Robert Haas said: "I have to admit I don't like this approach very much. I can't see adding 4 bytes to every commit record for this feature." which is a correct remark. xl_xact_commit can contain some arrays (relation to drops, committed sub-trans, shared invalidation msgs). The length of these arrays is specified using 3 ints in the struct. So, to avoid adding more ints to the struct, I've been suggested to remove all the ints, and use xl_xact_commit.xinfo to flag which arrays are, in fact, present. So the whole idea is: - remove nrels, nsubxacts and nmsgs from xl_xact_commit - use bits in xinfo to signal which arrays are present at the end of xl_xact_commit - for each present array, add the length of the array (as int) at the end ofxl_xact_commit - add each present array after all the lengths > I can't see any analysis that shows whether this would be effective in > reducing space of WAL records and what the impacts might be. The space of WAL records should always be <= than without the patch: in the worst case scenario, all ints are present (as they would without the patch). In the best case, which I guess is the more common, each xl_xact_commit will be 3 ints shorter. I don't think the effect will be "perceivable": the whole idea is to allow more variable-length structures in xl_xact_commit in the future, without incrementing the space on disk. > The patch contains very close to zero comments. I tried to add some. Leonardo commitlog_lessbytes01.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> - the patch is missing the "send all table pages to the > standby" part; is there some code I can use as base? > I guess I have to generate some special log type that > is only "played" by standby servers. Maybe I could use log_newpage, but instead of XLOG_HEAP_NEWPAGE I could use something like XLOG_HEAP_COPYPAGE; and in heap_redo, in the XLOG_HEAP_COPYPAGE case, call heap_xlog_newpage only in case we're in standby... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
I'll try to sum up what I understood: 1) the standby keeps the lock, so no problem with stray files coming from the unlogged->logged log reply, as the table can't be read during the operation 2) calling ResetUnloggedRelations before ProcArrayApplyRecoveryInfo would remove the problem of the stray files on the standby in case of master crash before commit/abort 3) promoting the standby shouldn't be an issue, since ResetUnloggedRelations is already called in ShutdownRecoveryTransactionEnvironment Now, to move forward, some questions: - the patch is missing the "send all table pages to the standby" part; is there some code I can use as base? I guess I have to generate some special log type that is only "played" by standby servers. - on the standby, the commit part should be played as it is on the master (that is, removing the INIT fork). The abort case is different though: it would mean doing nothing on the master, while removing every forks but the INIT fork on the standby. Would it be ok to add to xl_xact_abort a new array of RelFileNode(s), where for each one at abort all the forks, except the init fork, have to be deleted by the standby (while the master shouldn't do anything with them)? I bet there's a cleaner solution... Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> I'd guess some WAL record arising from the post-crash master restart makes the > standby do so. When a crash isn't involved, the commit or abort record is >that > signal. You could test and find out how it happens after a master crash > with >a > procedure like this: > > 1. Start a master and standby on the same machine. > 2. Connect to master; CREATE TABLE t(); BEGIN; ALTER TABLE t ADD c int; > 3. kill -9 -`head -n1 $master_PGDATA/postmaster.pid` > 4. Connect to standby and confirm that t is still locked. > 5. Attach debugger to standby startup process and set breakpoints on > StandbyReleaseLocks and StandbyReleaseLocksMany. > 6. Restart master. Well yes, based on the test the stack is something like: StandbyReleaseLocksManyStandbyReleaseOldLocks ProcArrayApplyRecoveryInfo xlog_redo It's not very clear to me what ProcArrayApplyRecoveryInfo does (not too familiar with the standby part I guess) but I see it's called by xlog_redo in the "info == XLOG_CHECKPOINT_SHUTDOWN" case and by StartupXLOG. But I don't know if calling ResetUnloggedRelations before the call to ProcArrayApplyRecoveryInfo in xlog_redo makes sense... if it makes sense, it would solve the problem of the stray files in the master crashing case I guess? > > > When you promote the standby, though, > > ShutdownRecoveryTransactionEnvironment() > > > releases the locks. If I understand the code, ResetUnloggedRelations is called before ShutdownRecoveryTransactionEnvironment, so that part shouldn't be an issue Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> On Wed, May 18, 2011 at 04:02:59PM +0100, Leonardo Francalanci wrote: > > > By the time the startup process > > > releases the AccessExclusiveLock acquired by the proposed > > > UNLOGGED -> normal conversion process, that relfilenode > > > needs to be either fully copied or unlinked all over again. > > > (Alternately, find some other way to make sure queries don't > > > read the half-copied file.) > > > > About this issue: how are AccessExclusiveLocks released on > > the standby when the master crashes? > > I assume those locks remain. It wouldn't be safe to release them; a master > crash is just one kind of WAL receipt latency. But somehow when the master restarts the standby gets notified it has the unlock??? > When you promote the standby, though, ShutdownRecoveryTransactionEnvironment() > releases the locks. Ok; then the problem in the UNLOGGED -> normal conversion is that: 1) the master and the standby acquire a lock on the table 2) the master starts sending the pages to the standby 3) the master crashes before committing up until here no problems, as the standby still has the lock on the table. 4) when the master restarts, it removes all the fork for rels with INIT forks; are those "deletes" sent to the standby? And, if yes, would those be replayed by the standby *before* releasing the lock? If the answer is "yes", then I don't think we have a problem... but I think that at the moment those unlogged-table-forks deletes aren't sent at all. (When promoting the standby, we could have ShutdownRecoveryTransactionEnvironment() remove all the fork for rels with INIT forks before releasing the locks) Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> By the time the startup process > releases the AccessExclusiveLock acquired by the proposed > UNLOGGED -> normal conversion process, that relfilenode > needs to be either fully copied or unlinked all over again. > (Alternately, find some other way to make sure queries don't > read the half-copied file.) About this issue: how are AccessExclusiveLocks released on the standby when the master crashes? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
this is a second version: now using intcounts[1]; /* variable-length array of counts */ in xl_xact_commit to keep track of number of different arrays at the end of the struct. Waiting for feedbacks... Leonardo commitlog_lessbytes00.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
> int counts[1]; /* variable-length array of counts, xinfo flags define > length of array and meaning of counts */ Damn, that's much cleaner than what I did. I don't know why I stuck with the idea that it had to be: int array int array ... instead of: int int ... array array ... which makes much more sense. > Then, I'd make macros like this: > > #define XactCommitNumberOfDroppedRelFileNodes(xlrec) \ >((xlref->xinfo & XACT_COMMIT_DROPPED_RELFILENODES) ? xlrec->counts[0] : 0) > #define XactCommitNumberOfCommittedSubXids(xlrec) \ >((xlref->xinfo & XACT_COMMITED_SUBXDIDS) ? > xlrec->counts[(xlrec->xinfo & XACT_COMMIT_DROPPED_RELFILENODES) ? 1 : > 0] : 0) > ...etc... ehm I don't know if macros would be enough; that's ok for the first 2, then I think it would become a mess... Maybe I'll use a simple function that gets all "ints" at once. > ...and a similar set of macros that will return a pointer to the > beginning of the corresponding array, if it's present. I'd lay out > the record like this: > > - main record > - array of counts (might be zero-length) > - array of dropped relfilnodes (if any) > - array of committed subxids (if any) > - array of sinval messages (if any) ok > Also, it's important not to confuse xact completion with xact commit, > as I think some of your naming does. Completion could perhaps be > thought to include abort. mmh... I don't know if I got it... but I'll give a look, and ask questions... Thank you very much for the help Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] use less space in xl_xact_commit patch
Hi, following the conversation at http://postgresql.1045698.n5.nabble.com/switch-UNLOGGED-to-LOGGED-tp4290461p4382333.html I tried to remove some bytes from xl_xact_commit. The way I did it needs palloc+memcpy. I guess it could be done reusing the memory for smgrGetPendingDeletes. But I don't think it's that important. I guess there are other ways of doing it; let me know what you think. Leonardo commitlog_firstv.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> I don't think making xinfo shorter will save anything, because > whatever follows it is going to be a 4-byte quantity and therefore > 4-byte aligned. ups, didn't notice it. I'll splitxinfo into: uint16 xinfo; uint16 presentFlags; I guess it helps with the reading? I mean, instead of having a single uint32? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> Yes, that seems like a very appealing approach. There is plenty of > bit-space available in xinfo, and we could reserve a bit each for > nrels, nsubxacts, and nmsgs, with set meaning that an integer count of > that item is present and clear meaning that the count is omitted from > the structure (and zero). This will probably require a bit of tricky > code reorganization so I think it should be done separately from the > main patch. Ok, I'll try and send a patch with this change only. BTW xinfo is 32 bit long, but I think only 2 bits are used right now? I think I can make it a 8 bits, and add another 8 bits for nrels, nsubxacts, and nmsgs and the new thing. That should save another 2 bytes, while leaving space for extention. Or we can make it a 8 bits only, but only 2 bits would be left "empty" for future extentions; I don't know if we care about it... Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> On Fri, Apr 22, 2011 at 4:13 AM, Leonardo Francalanci wrote: > >> Maybe you should change xl_act_commit to have a separate list of rels to > >> drop the init fork for (instead of mixing those with the list of files to > >> drop as a whole). > > > > I tried to follow your suggestion, thank you very much. > > I have to admit I don't like this approach very much. I can't see > adding 4 bytes to every commit record for this feature. I understand. What if, in xl_xact_commit, instead of RelFileNode xnodes I use another struct for xnodes, something like: { RelFileNode xnode; boolonlyInitFork; } That would increase the commit record size only when there are RelFileNode(s) to drop at commit. So, instead of 4 bytes in every commit, there are "wasted" bytes when the commit record contains deleted permanent relations (that should happen much less). I'm open to suggestions here... > > 3) Should we have a "cascade" option? I don't know if I have to handle > > inherited tables and other dependent objects > > Look at the way ALTER TABLE [ONLY] works for other action types, and copy it. Ok Thank you very much Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables, persistent kind
> > If that 1% is random (not time/transaction related), usually you'd rather >have an empty table. > > Why do you think it would be random? "Heap blocks would be zeroed if they were found to be damaged, following a crash." If you erase full blocks, you have no idea what data you erased; it could be something changed 1 hour ago, 1 month ago, 1 year ago. This is very different from, say, synchronous_commit=off: in that case, "the most recent transactions may be lost if the database should crash". In your case, "some (who knows which???) data is lost". So, to me that sounds like random loss. I don't think that that is different from a corrupted table. You're not deleting rows recently changed; you're deleting everything that is "physically close" to it. > > In other words: is a table that is not consistant with anything else in > > the >db useful? > > That's too big a leap. Why would it suddenly be inconsistent with the > rest of the database? If you delete some data, and you have no idea what data you lost, I don't think you have a consistent db. Unless, of course, your table has no relation with any other table in the db. Of course, all these thoughts are based on the assumption that I know what happens when a block is erased; but my knowledge of postgresql internals is not so good, so I might be *very* wrong -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables, persistent kind
> The amount of data loss on a big > table will be <1% of the data loss >caused by truncating the whole table. If that 1% is random (not time/transaction related), usually you'd rather have an empty table. In other words: is a table that is not consistant with anything else in the db useful? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables, persistent kind
The only data we can't rebuild it's the heap. So what about an option for UNlogged indexes on a LOGged table? It would always preserve data, and it would 'only' cost a rebuilding of the indexes in case of an unclean shutdown. I think it would give a boost in performance for all those cases where the IO (especially random IO) is caused by the indexes, and it doesn't look too complicated (but maybe I'm missing something). I proposed the unlogged to logged patch (BTW has anyone given a look at it?) because we partition data based on a timestamp, and we can risk loosing the last N minutes of data, but after N minutes we want to know data will always be there, so we would like to set a partition table to 'logged'. Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What Index Access Method Functions are really needed?
> > another question regarding indexes. Sadly I can't find enough info in > > the documentation. Which of the functions are needed in order for a > > index to work? > > All of them. Maybe I completely misunderstood the question, but some functions are "optionals", such as amgetbitmap, right? http://www.postgresql.org/docs/9.0/static/index-functions.html -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> Maybe you should change xl_act_commit to have a separate list of rels to > drop the init fork for (instead of mixing those with the list of files to > drop as a whole). I tried to follow your suggestion, thank you very much. Here's a first attempt at the patch. I "tested" it with: create table forei (v integer primary key); insert into forei select * from generate_series(1,1); create unlogged table pun (c integer primary key, constraint con foreign key (c) references forei(v)); insert into pun select * from generate_series(1,1); alter table pun set logged; then shutdown the master with "immediate": bin/pg_ctl -D data -m immediate stop bin/pg_ctl -D data start and "pun" still has data: select * from pun where c=100; Question/comments: 1) it's a very-first-stage patch; I would need to know if something is *very* wrong before cleaning it. 2) there are some things I implemented using a logic like "let's see how it worked 10 lines above, and I'll do the same". For example, the 2PC stuff is totally "copied" from the other places, I have no idea if the code makes sense at all (how can I test it?) 3) Should we have a "cascade" option? I don't know if I have to handle inherited tables and other dependent objects 4) During the check for dependencies problems, I stop as soon as I find an error; would it be enough? Leonardo unl2log_20110422.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
I think I coded a very basic version of the UNLOGGED to LOGGED patch (only wal_level=minimal case for the moment). To remove the INIT fork, I changed somehow PendingRelDelete to have a flag "bool onlyInitFork" so that the delete would remove only the INIT fork at commit. Everything "works" (note the quotes...) except in the case of not-clean shutdown ("-m immediate" to pg_ctl stop). The reason is that the replay code doesn't have any idea that it has to remove only the INIT fork: it will remove ALL forks; so at the end of the redo procedure (at startup, after a "pg_ctl -m immediate stop") the table doesn't have any forks at all. See xact_redo_commit: /* Make sure files supposed to be dropped are dropped */ for (i = 0; i < xlrec->nrels; i++) { [...] for (fork = 0; fork <= MAX_FORKNUM; fork++) { if (smgrexists(srel, fork)) { XLogDropRelation(xlrec->xnodes[i], fork); smgrdounlink(srel, fork, true); } } smgrclose(srel); } [...] Should I change xl_xact_commit in order to have, instead of: /* Array of RelFileNode(s) to drop at commit */ RelFileNode xnodes[1]; /* VARIABLE LENGTH ARRAY */ an array of structures such as: { RelFileNode relfilenode; bool onlyInitFork; } ??? Otherwise I don't know how to tell the redo commit code to delete only the INIT fork, instead of all the relation forks... (maybe I'm doing all wrong: I'm open to any kind of suggestion here...) Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> > If the master crashes while a transaction that used CREATE TABLE is >unfinished, > > both the master and the standby will indefinitely retain identical, stray >(not > > referenced by pg_class) files. The catalogs do reference the relfilenode of > > each unlogged relation; currently, that relfilenode never exists on a >standby > > while that standby is accepting connections. By the time the startup >process > > releases the AccessExclusiveLock acquired by the proposed UNLOGGED -> normal > > conversion process, that relfilenode needs to be either fully copied or >unlinked > > all over again. (Alternately, find some other way to make sure queries >don't > > read the half-copied file.) In effect, the problem is that the > > relfilenode >is > > *not* stray, so its final state does need to be well-defined. > > Oh, right. > > Maybe we should just put in a rule that a server in Hot Standby mode > won't ever try to read from an unlogged table (right now we count on > the fact that there will be nothing to read). If we crash before > copying the whole file, it won't matter, because the catalogs won't > have been updated, so we'll refuse to look at it anyway. And we have > to reinitialize on entering normal running anyway, so we can clean it > up then. Ok then... I'll try to code the "easy" version first (the wal_level=minimal case) and then we'll see Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> > But re-reading it, I don't understand: what's the difference in creating > > a new "regular" table and crashing before emitting the abort record, > > and converting an unlogged table to logged and crashing before > > emitting the abort record? How do the standby servers handle a > > "CREATE TABLE" followed by a ROLLBACK if the master crashes > > before writing the abort record? I thought that too would "leave a > > stray file around on a standby". > > I've been thinking about the same thing. And AFAICS, your analysis is > correct, though there may be some angle to it I'm not seeing. Anyone else? I would like to know if what I'm trying to do is, in fact, possible... otherwise starting with thewal_level=minimal case first will be wasted effort in case the other cases can't be integrated somehow... Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] switch UNLOGGED to LOGGED
> I'm pretty sure we wouldn't accept a patch for a feature that would > only work with wal_level=minimal, but it might be a useful starting > point for someone else to keep hacking on. I understand. Reading your post at http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php I thought I got the part: "what happens if we *crash* without writing an abort record? It seems like that could leave a stray file around on a standby, because the current code only cleans things up on the standby at the start of recovery" But re-reading it, I don't understand: what's the difference in creating a new "regular" table and crashing before emitting the abort record, and converting an unlogged table to logged and crashing before emitting the abort record? How do the standby servers handle a "CREATE TABLE" followed by a ROLLBACK if the master crashes before writing the abort record? I thought that too would "leave a stray file around on a standby". Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] switch UNLOGGED to LOGGED
Hi, I read the discussion at http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php From what I can understand, going from/to unlogged to/from logged in the wal_level == minimal case is not too complicated. Suppose I try to write a patch that allows ALTER TABLE tablename SET LOGGED (or UNLOGGED) (proper sql wording to be discussed...) only in the wal_level == minimal case: would it be accepted as a "first step"? Or rejected because it doesn't allow it in the other cases? From what I got in the discussion, handling the other wal_level cases can be very complicated (example: the issues in case "we *crash* without writing an abort record"). Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: What do these terms mean in the SOURCE CODE?
> Here's my single biggest tip for newcomers to the Postgres source: > if you don't use ctags, glimpse, or some other tool that can quickly > show you all references to a given identifier, go out and get one. > It's one of the easiest ways to learn about things. I guess Eclipse is the best tool for the "newcomer" as it gives you all the references very easily and shows the functions/structs declaration when you pass with the mouse over one... http://wiki.postgresql.org/wiki/Working_with_Eclipse -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plan time of MASSIVE partitioning ...
> Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes > per child table, only one of which was relevant to the query. In your > test case there are no irrelevant indexes, which is why the runtime > didn't change. Mmh... I must be doing something wrong. It looks to me it's not just the irrelevant indexes: it's the "order by" that counts. If I remove that times are the same with and without the patch: using the test case: explain select * from inh_parent where timestamp1 between '2010-04-06' and '2010-06-25' this one runs in the same time with the patch; but adding: order by timestamp2 made the non-patched version run 3 times slower. My test case: create table t (a integer, b integer, c integer, d integer, e text); DO $$DECLARE i int; BEGIN FOR i IN 0..2000 LOOP EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || ' and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)'; EXECUTE 'create index taidx' || i || ' ON t' || i || ' (a)'; EXECUTE 'create index tbidx' || i || ' ON t' || i || ' (b)'; EXECUTE 'create index tcidx' || i || ' ON t' || i || ' (c)'; EXECUTE 'create index tdidx' || i || ' ON t' || i || ' (d)'; END LOOP; END$$; explain select * from t where a > 1060 and a < 109000 this runs in 1.5 secs with and without the patch. But if I add order by b the non-patched version runs in 10 seconds. Am I getting it wrong? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plan time of MASSIVE partitioning ...
> Hmm, maybe I spoke too soon. With 9000 child tables I get a profile > like this: Well, the 9000-table-test-case was meant to check the difference in performance with/without the patch... I don't see the reason for trying to optimize such an unrealistic case. BTW can someone explain to me which are the cases where the patch actually helps? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plan time of MASSIVE partitioning ...
> This is going to be dominated by constraint exclusion checking. There's > basically no fix for that except a more explicit representation of the > partitioning rules. Damn, I knew that was going to be more complicated :) So in which case does this patch help? I guess in a multi-index scenario? childtables.sql is kind of hard to read (I think a FOR loop would have been more auto-explaining?). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plan time of MASSIVE partitioning ...
> but I don't get any gain from the patch... explain time is still around 250 >ms. > Tried with 9000 partitions, time is still 2 secs. Small correction: I tried with 3000 partitions (FOR i IN 0..3000 ...) and got 250ms with both versions, with 9000 partitions 2 secs (again no gain from the patch) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plan time of MASSIVE partitioning ...
> On the other hand, if I use a similar test case to my original one > (i.e. the tables are much wider) then the query planning takes > 1.42 seconds in 9.1 with this patch instead of about 4.7 seconds > as we observed it using PostgreSQL 9.0.0. The beginning of the gprof > output now looks like this: Hi, I'm really interested in this patch. I tried a simple test case: create table t (a integer, b text); DO $$DECLARE i int; BEGIN FOR i IN 0..9000 LOOP EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || ' and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)'; EXECUTE 'create index tidx' || i || ' ON t' || i || ' (a)'; END LOOP; END$$; explain select * from t where a > 1060 and a < 1090; but I don't get any gain from the patch... explain time is still around 250 ms. Tried with 9000 partitions, time is still 2 secs. Maybe I've missed completely the patch purpose? (I tried the test case at http://archives.postgresql.org/message-id/4cbd9ddc.4040...@cybertec.at and that, in fact, gets a boost with this patch). Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Custom aggragation function that creates an array
> Before you start coding, have you looked over what's in > contrib/intarray ? There's nothing that fulfills my needs there, but I guess it would be the perfect place to watch for code examples! Thank you (I think I made my custom aggregation function work, but I'll look into intarray code to see if I made mistakes...) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Custom aggragation function that creates an array
Hi, I want to write a custom agg function that, given an "int4 index", increments the element at "index" of an array and, at the end, returns the array. The array will always be int4[]. I need it in C, since plpgsql is way slower (and I need to use it in 5M+ rows). I did it, but I also need to create the array "on demand", not as initcond (the array size will be a parameter). I suppose that, since I'll be using "construct_array" (if it makes sense), I'll need a new MemoryContext? I've looked at array_agg_transfn, but I think that's a lot more complicated because it'll repalloc, so it has to keep the memorycontext... Since I'm not familiar with these things, would something like this work? if (PG_ARGISNULL(0)) { // first time, alloc array arr_context = AllocSetContextCreate(rcontext, "myCustomArrayResult", ALLOCSET_DEFAULT_MINSIZE, ALLOCSET_DEFAULT_INITSIZE, ALLOCSET_DEFAULT_MAXSIZE); oldcontext = MemoryContextSwitchTo(arr_context); transarray = construct_array([...]); MemoryContextSwitchTo(oldcontext ); } else { transarray = PG_GETARG_ARRAYTYPE_P(0); } This is how the C function is written right now: ArrayType *transarray; int32 index = PG_GETARG_INT32(1); int32 *array; if (AggCheckCallContext(fcinfo, NULL)) { // code above would go in place of this line: transarray = PG_GETARG_ARRAYTYPE_P(0); } else transarray = PG_GETARG_ARRAYTYPE_P_COPY(0); array = (int32*)ARR_DATA_PTR(transarray); array[index]++; PG_RETURN_ARRAYTYPE_P(transarray); (checks on boundaries etc omitted) The problem is that it expects the array to be already available, so I had to do: CREATE AGGREGATE cust_agg_func (int4) ( sfunc = cust_agg_funct_add, stype = int4[], initcond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}' ); I would like to have: CREATE AGGREGATE cust_agg_func (int4 index, int4 array_size_to_be_created) ( sfunc = cust_agg_funct_with_array_creation_add, stype = int4[] ); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On the usefulness of hint bits
> I wonder if we could improve this with some sort of process-local > cache - not to get rid of hint bits, just to reduce pg_clog > contention. We might easily end up testing the same XID many times > during the same table scan. I guess that's my scenario... not that many transactions, so even "copying" the whole pg_clog table in the per-process memory would be doable... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On the usefulness of hint bits
> Reduction of contention for pg_clog access, for one thing. If you read > the archives, you'll find that pg_clog access contention has been shown > to be one cause of "context swap storms". Having to go to clog for > every single tuple access would make that orders of magnitude worse. Ok; is it the "Wierd context-switching issue on Xeon" thread? Or that has nothing to do with it? I tried "context swap storms pg_clog" but I didn't get anything... > More generally, we're not going to give up hint bits even if there are > identifiable workloads where they don't buy much --- because there are > many others where they do. Sure: I wasn't suggesting to give up them, just to make their usage "user selectable" (and "on" by default) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] On the usefulness of hint bits
Hi, I was wondering what is the advantage of having hint bits for OLAP -style workloads, that is when the number of transactions is not that high. If I got it right, in 10 pg_clog pages we can store the status for more than 32 transactions. That's a lot, in a very small space (80KB?). So I was wondering what's the gain we get from hint bits in cases where pg_clog is "small" (that is, will be cached by postgresql/the OS). does somebody have some numbers on the effect of hint bits on first/second reads? I mean: create mytable as . select * from mytable -> this one will update hint bits select * from mytable -> this one will use them to test it I guess we should change the code to have: a version where they are never updated (that is, always look at pg_clog) (so that you don't have to write them the first time, and you'll get the "true" reading time + pg_clog reading time) a version that always set them to COMMITTED (so that you don't have to write them the first time, and you'll get the "true" reading time for the "second" read that would use them, regardless of any pg/OS cache) I'm asking because I don't like having all those writes on the first scan... and I would like to know what the real benefit is for the reads that come after the first one in case there are "few" transactions per second (for example, less than 1 transaction per second) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: I: [HACKERS] About "Our CLUSTER implementation is pessimal" patch
> Applied with some significant editorialization. The biggest problem I > found was that the code for expression indexes didn't really work, and > would leak memory like there's no tomorrow even when it did work. Sorry I couldn't write the way it was supposed to... I'll look at the differences to see what I did wrong... (so maybe next time I'll do better!) Thank you Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: I: [HACKERS] About "Our CLUSTER implementation is pessimal" patch
> It sounds like the costing model might need a bit more work before we commit >this. I tried again the simple sql tests I posted a while ago, and I still get the same ratios. I've tested the applied patch on a dual opteron + disk array Solaris machine. I really don't get how a laptop hard drive can be faster at reading data using random seeks (required by the original cluster method) than seq scan + sort for the 5M rows test case. Same thing for the "cluster vs bloat" test: the seq scan + sort is faster on my machine. I've just noticed that Josh used shared_buffers = 16MB for the "cluster vs bloat" test: I'm using a much higher shared_buffers (I think something like 200MB), since if you're working with tables this big I thought it could be a more appropriate value. Maybe that's the thing that makes the difference??? Can someone else test the patch? And: I don't have that deep knowledge of how postgresql deletes rows; but I thought that something like: DELETE FROM mybloat WHERE RANDOM() < 0.9; would only delete data, not indexes; so the patch should perform even better in this case (as it does, in fact, on my test machine), as: - the original cluster method would read the whole index, and fetch only the "still alive" rows - the new method would read the table using a seq scan, and sort in memory the few rows still alive But, as I said, maybe I'm getting this part wrong... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: I: [HACKERS] About "Our CLUSTER implementation is pessimal" patch
> I ran a few more performance tests on this patch. Here's what I got > for the tests Leonardo posted originally: >* 2M rows: 22 seconds for seq. scan, 24 seconds for index scan >* 5M rows: 139 seconds for seq. scan, 97 seconds for index scan >* 10M rows: 256 seconds seq. scan, 611 seconds for index scan I don't have time right now to run more tests, I'll try to make some by next week. Would it mean that doing: create table p as select * from atable order by akey (where akey is random distributed) with 5M rows is faster with enable_seqscan=0 and enable_indexscan=1??? That would be weird, especially on a laptop hard drive! (assuming there's a reasonable amount of memory set in work_mem/maintenance_work_mem) > I tried a few more tests of creating a table with either 10M or 50M > rows, then deleting 90% of the rows and running a cluster. The patch > didn't fare so well here: > * 10M rows: 84 seconds for seq. scan, 44 seconds for index scan [...] > So I think there are definitely cases where this patch helps, but it > looks like a seq. scan is being chosen in some cases where it doesn't > help. > > Test machine: MacBook Pro laptop, C2D 2.53 GHz, 4GB RAM. Again: what would the planner choose in that case for a: create table p as select * from mybloat order by myid ??? I guess that if the planner makes a wrong choice in this case (that is, seq scan + sort instead of index scan) there's no way for "cluster" to behave in a different way. If, on the contrary, the "create table..." uses the right plan, and cluster doesn't, we have a problem in the patch. Am I right? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: I: [HACKERS] About "Our CLUSTER implementation is pessimal" patch
> > Here's my post with a (very simple) performance test: > > http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php > I think the 10M rows test is more in line with what we want (83s vs. 646). Can someone else test the patch to see if what I found is still valid? I don't think it makes much sense if I'm the only one that says "this is faster" :) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: I: [HACKERS] About "Our CLUSTER implementation is pessimal" patch
> > 10% is nothing. I was expecting this patch would give an order of > > magnitude of improvement or somethine like that in the worst cases of > > the current code (highly unsorted input) > > Yes. It should be x10 faster than ordinary method in the worst cases. Here's my post with a (very simple) performance test: http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php The test I used wasn't a "worst case" scenario, since it is based on random data, not wrong-ordered data. Obviously, the real difference can be seen on large tables (5M+ rows), and/or slow disks. Leonardo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: I: [HACKERS] About "Our CLUSTER implementation is pessimal" patch
> I think writetup_rawheap() and readtup_rawheap() are a little complex, > but should work as long as there are no padding between t_len and t_self > in HeapTupleData struct. > > - It might be cleaner if you write the total item length > and tuple data separately. > - "(char *) tuple + sizeof(tuplen)" might be more robust > than "&tuple->t_self". - I used your functions - changed the docs for CLUSTER (I don't know if they make sense/are enough) - added a minor comment 2 questions: 1) about the "copy&paste from FormIndexDatum" comment: how can I improve it? The idea is that we could have a faster call, but it would mean copying and pasting a lot of code from FormIndexDatum. 2) what other areas can I comment more? sorted_cluster-20100721.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers