from:"\"Leonardo Francalanci\""

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-13 Thread Leonardo Francalanci

Simon Riggs wrote
>> Can you CLUSTER
>> against a minmax index?
> 
> Not in this release, at least in my understanding. It's not yet
> possible to do an ordered fetch, so the cluster scan probably won't
> work.

As per the patch I helped writing, CLUSTER should use the
sequential heap scan+sort when "it makes sense".
So I think that if the index is not able to do an ordered fetch, 
CLUSTER should fall back to scan+sort automatically (which is
what you want in a large table anyway).

Obviously, that should be tested.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778171.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-13 Thread Leonardo Francalanci

Simon Riggs wrote
> From our discussions here, IMHO there is a strong case for avoiding
> btrees completely for larger historical data tables. That isn't
> something I had even considered as desirable before this conversation
> but ISTM now that taking that approach will be more fruitful than
> attempting to implement LSM trees.

Eh? I don't understand this point. How can I avoid btrees, and
searching by caller_id? I don't get it...


Simon Riggs wrote
> Alvaro has given me some results for his patch. The figures I have are
> for a 2GB table.
> 
> Index Build Time
> MinMax 11 s
> Btree   96s
> 
> Index Size
> MinMax 2 pages + metapage
> Btree approx 200,000 pages + metapage
> 
> Load time
> MinMax no overhead, same as raw COPY
> BTree - considerably slower

Great!!! This looks very promising. Were the values indexed
sequential?




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778150.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-13 Thread Leonardo Francalanci

Jeremy Harris wrote
> Surely there's good correlation between IMSI & IMEI, so have a separate
> table to translate one to (a group of) the others, and
> halve the indexes on your main table?

Yes; unfortunately not always both are available; but it's something
we are thinking about (it requires logic in the "inserting application"
that at the moment doesn't exist, but it is something that we'll
have to add sooner or later).
But in the end yes, trying to use less indexed-fields is a good path.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778125.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-13 Thread Leonardo Francalanci

Simon Riggs wrote
> So in the use case you describe, the min max index would require a
> scan of only 25% of the table, not the 80% described earlier for
> random inserts. In my experience, people wish to keep data for much
> longer periods and so the percentage of scan required would drop lower
> than 25%, possibly to 5% or less for many applications.
> 
> The plan would use sequential I/O so could still work quickly; given
> the low read rate, longer query times could be acceptable.


"Quickly"??? Seq scan 25% of a table of TB is not "quick".



Simon Riggs wrote
> Minmax indexes are simply a way to make this use case happen
> automatically, without the need for manual partitioning of the table.

You logic assumes that we don't index anything but the call_timestamp.
That would lead to huge query response times.
Plus: btree doesn't have a big problem to keep up in sequential insertion
scenario (such as the call_timestamp). So I still don't see the gain
in using the minmax indexes: again, could you point me to some
performance tests of *any* use case?


Simon Riggs wrote
> They are not the answer to every prayer, but with respect they are
> better than you had claimed they would be. (25% not 80%, in your use
> case). I saw this was likely to be the case and this is why I
> challenged you to describe in more detail. Thank you.

I claimed they would scan the 80% of the table because I assumed I
had to use them in the random fields; not in the call_timestamp field.
I don't need a better index in the call_timestamp, because it's sequential,
I don't have problems with that. But it's useless: I don't want to seq scan
25% of a multi-TB table.



Simon Riggs wrote
> Performance tests are only possible with a clear use case. 

Well, so I can add my weird_index patch in postgresql source code,
and it would be committed right away??? I assumed you had
to prove somehow that the new index was better than what
is already available, at least for some cases.

Or, in other words: what are you going to write in the minmax index
documentation, "try and see if they work better for you"?


Simon Riggs wrote
> Please see that Alvaro and I have gone out of our way to provide a new
> facility to help you and others, and that it requires changing how we
> think about the solution. I accept it may not provide for every case
> but it requires careful analysis before deciding that is so.

If I came out too rough, I ask your pardon. I always appreciate people
taking their time to help someone else for free. 

Plus, I'm *very* interested in the minmax index, especially for
call_timestamp (some queries are date-range only, such as "give me
all the calls in this particular 2 secs range) or the id column I have.
But, at the same time, I don't see any evidence that they work
better than btrees (except for the size of the index). 
I would like to see some numbers.

I worked a little in the bitmap index implementation, and I stopped because
on the large tables these indexes are supposed to be used, the heap
lookup took so much time that the (slightly) faster index access didn't
really help, because it was a fraction of the query time... I'm afraid
it would be the same with minmax indexes, that's why I wanted to 
see some numbers...


Simon Riggs wrote
>> And, again, I think that random values insertion is the worst
>> use case for minmax indexes.
> 
> I agree, it is. What we have disagreed on is the extent to which that
> scenario exists for use cases on very large tables, which are
> typically "append-mostly" with most queries accessing a subset of the
> table, e.g. date range.

Mhh... maybe this is this point we don't understand each other?

I query the table by userid + date range.
The date range is *not* selective enough (it's, as you said, 25%
of the multi-TB table). The userid is selective *a lot*. 

I'm looking for a "better" index for the userid column(s).

The "new" indexes I mentioned in the OP claim they are better
in this scenario (but I don't blindly believe them)

I don't see how indexing the call_timestamp only could be
of any interest, since it would require seq-scanning 25%
of a huge table for every query.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778124.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-13 Thread Leonardo Francalanci

Simon Riggs wrote
> On 5 November 2013 14:28, Leonardo Francalanci <

> m_lists@

> > wrote:
> 
>> Either my sql is not correct (likely), or my understanding of the minmax
>> index is
>> not correct (even more likely), or the minmax index is not usable in a
>> random inputs
>> scenario.
> 
> Please show the real world SQL you intend to run, so we can comment on
> it. Inventing a use case that breaks effectiveness of any optimisation
> is always easy, but the question is whether the use case is likely or
> even desirable.

The use case is pretty simple.
Think it as the NSA, as it would be much easier.
Collect all calls made/received by any user on a mobile network.
(in fact, it's something more than calls, so in fact is 2-10 rows per call).
Keep the data for 20 days.
That's the insert part.

Query:
search calls made/received by the user using IMSI (caller id) or IMEI
(phone id). Date range is usually days (past 4 days, from 10 days ago
to 5 days ago...)

The result is just a very small percentage of the rows present in the
table: a single user doesn't call that much!
Searches are made by a human, so no that many request per second.

It's not a "write mostly" scenario, it's a 99% write 1% read scenario.

Problem? having 4 btree indexes on random values (imsi+imei * 2,
since we have calling and caller) kills the performance in insertion
after a while.
Solution so far? partition every 15 minutes, create the indexes in bulk.

Simon Riggs wrote
> If we have a query to show the most recent calls by a particular caller
> 
> SELECT *
> FROM cdr
> WHERE callerid = X
> ORDER BY call_timestamp DESC
> LIMIT 100
> 
> then this could potentially be optimised using a minmax index, by
> traversing the data ranges in call_timestamp order. That is not part
> of the code in this initial release, since the main use case is for
> WHERE call_timestamp >= X, or WHERE primarykey = Y

I don't understand how a index on call_timestamp would help
in the query above.

Simon Riggs wrote
> I don't believe there is a credible business case for running that
> same query but without the ORDER BY and LIMIT, since it could
> potentially return gazillions of rows

Gazillion of rows??? We're talking about calls made/received by
one user here. How many calls do you make in 10 days???

Simon Riggs wrote
>  so it isn't surprising at all
> that it would access a large % of the table. 

In fact, the query I use return a fraction of the table, and only
a very small amount of users get searched.

Simon, you keep on talking about these minmax indexes, and
I still don't see any reference to some performance tests.

And, again, I think that random values insertion is the worst
use case for minmax indexes.

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778092.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci

Jeff Janes wrote
> Some experiments I did a few years ago showed that applying sorts to the
> data to be inserted could be helpful even when the sort batch size was as
> small as one tuple per 5 pages of existing index.  Maybe even less.

Cool!!! Do you have any idea/hint on how I could try and replicate that?
Do you remember how you did it?




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777056.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci

Claudio Freire wrote
> Well, of course, they're not magic pixie dust. 

Of course they aren't. I think they can make a difference in a sequential
input scenario. But I'm not the one who said that they are fit to
solve the problems me and other people are talking about in this thread.


Claudio Freire wrote
> But is your data really random? (or normal?)
> That's the thing...

I still don't understand.

Even if the data was normal, it wouldn't work. You can try and
change the 3rd parameter in normal_rand and get a "narrower"
distribution, but at the same time the query values should be
narrower... so you'll get the same results. 

The problem is: if the range you get between min and max is 
too big in each page range, you'll end up scanning a lot of heap
pages.

To me, "the thing" is: has anyone made performance tests of these
minmax indexes with an input that is not sequential?

(BTW I'd like to see tests for the sequential input scenario too...)





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777055.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci

Claudio Freire wrote
> real data isn't truly random

Well, let's try normal_rand???

create table t1 as select trunc(normal_rand(100, 50, 3)) as n,
generate_series(1, 100) as i;

with cte as 
 (select min(n) as minn, max(n) as maxn, i/100 from t1 group by i/100),
 inp as (select generate_series(1, 100) iinp, trunc(normal_rand(100,
50, 3)) as s)

 select count(*),iinp from inp
   left outer join cte on inp.s between minn and maxn group by iinp;

Not that much different in my run... 



Claudio Freire wrote
> you haven't really
> analyzed update cost, which is what we were talking about in that last
> post.

I don't care for a better update cost if the cost to query is a table scan.
Otherwise, I'll just claim that no index at all is even better than minmax:
0 update cost, pretty much same query time.

Maybe there's value in minmax indexes for sequential data, but not for
random data, which is the topic of this thread.

BTW I would like to see some performance tests on the minmax indexes
vs btree for the sequential inputs... is the gain worth it? I couldn't find
any mention of performance tests in the minmax threads.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777020.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci

Claudio Freire wrote
> Min-max indexes always require a sequential scan of the min-max index
> itself when querying. 

I'm worried about the number of heap pages that will be scanned.
My understanding is that given the random input, the index will
not be selective enough, and will end up requiring a lot of page 
scanning to get the relevant rows.

That is: the "selectivity" of the min/max values will be very low, given
the high cardinality and randomness of the input; I'm afraid that, in
most "page ranges", the min will be lower than the queried ones,
and the max higher, resulting in lots of heap page scans.

Quick test:
a table with 1M rows, with random values from 1 to 1000:
create table t as select generate_series(1, 100) as i, trunc(random() *
1000) as n;

suppose a page range contains 100 rows (but my understanding is that minmax
index will use a much bigger row count...), let's find how many "page
ranges"
should be scanned to find a series of 50 values (again, random in the
1-1000 range):

with cte as 
 (select min(n) as minn, max(n) as maxn, i/100 from t group by i/100),
 inp as (select generate_series(1, 50) iinp, trunc(random() * 1000) as
s)
 select count(*) from inp
   left outer join cte on inp.s between minn and maxn group by iinp


I get > 9000 pages for 49 values out of 50... which means scanning 90% of
the table.

Either my sql is not correct (likely), or my understanding of the minmax
index is
not correct (even more likely), or the minmax index is not usable in a
random inputs
scenario.





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777009.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci

Simon Riggs wrote
> On 5 November 2013 09:57, Leonardo Francalanci <

> m_lists@

> > wrote:
>> While I do believe in testing (since "In theory there is no difference
>> between theory and practice. In practice there is"), I would like to know
>> the "properties" of the minmax index before trying it.
>> What is it supposed to be good at? What are the pros/cons? We can't ask
>> all
>> the users to just "try" the index and see if it works for them.
> 
> No, but then all users aren't suggesting we need a new index type are
> they?
> 
> I think its reasonable for you to spend time checking whether what you
> require will be met, and if not, what precise query doesn't it help,
> so we can better design any future new-index.

I don't understand the parallel "We can't ask all the users to just try the
index and see if it works for them" with "all users aren't suggesting we
need a new index type". Anyway:

I'm not suggesting we need a new index type. Please read my first post: I'm
asking info, fearing that there's just a lot of marketing/hype in those
indexes.

What do you mean by "spend time checking whether what you require will be
met"? Met by what, minmax indexes?

Simon Riggs wrote
>> As I said, my understanding is that is very efficient (both in insertion
>> and
>> in searching) when data is somehow ordered in the table. But maybe I got
>> it
>> wrong...
> 
>> Anyway, the sql scenario is simple: a table with 4 bigint indexes; data
>> in
>> the fields is a random bigint in the range 1-1000. Insertion is 5-10K
>> rows/second. One search every 1-5 seconds, by one of the indexed fields
>> (only equality, no range). There's also an index on a timestamp field,
>> but
>> that's not random so it doesn't give that many problems (it's actually
>> the
>> one where I wanted to try the minmax).
> 
> Without meaning to pick on you, imprecise analysis yields unhelpful
> new features. The clearer we are about what we are trying to solve the
> more likely we are to solve it well. 30 seconds analysis on what is
> needed is not sufficient to justify an additional man year of
> development, especially if a man year of work has already been done
> and the testing of the latest feature is now at hand.

I've never said that my analysis justifies a man year of work. As I said,
I'm actually not confident at all that even if we had those "cool" indexes
they would work on my scenario (marketing aside, there's not that much
data/tests out there on those indexes). I just wanted to know if the matter
was discussed in the past / getting more info.

At the same time, I'm reluctant to try a new index hoping it will work in my
case just because it's new and a man year of work has already been done.
Again: what's this minmax index supposed to be good at?
If it's indexing data in mostly-sequential order, it won't help me. From
what I got (maybe I got it wrong?) the index stores min/max values of
sequence of pages. In my case I guess that those min/max values would be
close to 1 (min) /1000 (max), because I insert data in random order. So
any query will scan the entire table anyway. Am I wrong?

Simon Riggs wrote
> The requests from the indexes field, are they ordered? 

Mmmh, I don't think I understand the question... an operator searches for
calls made by a user, so he searches in random order... 

Simon Riggs wrote
> are they
> limited? Or you really want ALL calls? What is the tolerance on those?

I want all the calls made by the user. But there won't be many calls for 1
user.
And most users will never be queried (as I said, one "calls by person"
search every 1-5 seconds, so a very small percentage of calls will ever be
queried/retrieved)

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776982.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci

Simon Riggs wrote
> Minmax indexes seem to surprise many people, so broad generalisations
> aren't likely to be useful.
> 
> I think the best thing to do is to publish some SQL requests that
> demonstrate in detail what you are trying to achieve and test them
> against minmax indexes. That way we can discuss what does work and
> what doesn't work well enough yet.

While I do believe in testing (since "In theory there is no difference
between theory and practice. In practice there is"), I would like to know
the "properties" of the minmax index before trying it.
What is it supposed to be good at? What are the pros/cons? We can't ask all
the users to just "try" the index and see if it works for them. 
As I said, my understanding is that is very efficient (both in insertion and
in searching) when data is somehow ordered in the table. But maybe I got it
wrong...

Anyway, the sql scenario is simple: a table with 4 bigint indexes; data in
the fields is a random bigint in the range 1-1000. Insertion is 5-10K
rows/second. One search every 1-5 seconds, by one of the indexed fields
(only equality, no range). There's also an index on a timestamp field, but
that's not random so it doesn't give that many problems (it's actually the
one where I wanted to try the minmax).




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776976.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci

Simon Riggs wrote
> Everybody on this thread is advised to look closely at Min Max indexes
> before starting any further work.
> 
> MinMax will give us access to many new kinds of plan, plus they are
> about as close to perfectly efficient, by which I mean almost zero
> overhead, with regard to inserts as it is possible to get.

Simon, I don't understand how minmax indexes would help in a random-inserts
scenario.
While I would love to use minmax for other columns (since we also partition
and search based on a timestamp, which is usually well clustered), I thought
minmax index would be perfect in a mostly-incremental values scenario. 



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776964.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-05 Thread Leonardo Francalanci

Andres Freund-3 wrote
> On 2013-11-04 11:27:33 -0500, Robert Haas wrote:
>> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire <

> klaussfreire@

> > wrote:
>> > Such a thing would help COPY, so maybe it's worth a look
>> 
>> I have little doubt that a deferred insertion buffer of some kind
>> could help performance on some workloads, though I suspect the buffer
>> would have to be pretty big to make it worthwhile on a big COPY that
>> generates mostly-random insertions.
> 
> Even for random data presorting the to-be-inserted data appropriately
> could result in much better io patterns.

Mmh, I'm afraid that the buffer should be huge to get some real advantage.
You have to buffer enough values to avoid "touching" entire pages, which is
not that easy. The LSM-tree is much complicated than a simple memory-buffer
+ delayed inserts. The other index types that are supposed to help in the
random-insertion workload rely on sequential insertions (at the expense of
more writing, and slower reads) rather than re-use the btree pattern.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776963.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-31 Thread Leonardo Francalanci

Gavin Flower-2 wrote
> How about being able to mark indexes:
>  'MEMORY ONLY' to make them not go to disk
> and
>  'PERSISTENT | TRANSIENT' to mark if they should be recreated on 
> machine bootup?

I would love that. But:

1) I'd like to make some tests with a "memory drive", and confirm that in
fact this would help (I'm sure I tried in the past, but I don't remember the
outcome)
2) I don't know if the fact that they are in memory should be handled by the
db or not. I was thinking about something more like "RECREATE IF NOT FOUND",
that is: if the files aren't found at postgresql startup, re-create the
index...
3) I don't know how many people would be interested (and how
doable/complicated that would be, considering log-replay, replication etc
etc)





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776471.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-31 Thread Leonardo Francalanci

Jeff Janes wrote
> True, but that is also true of indexes created in bulk.  It all has to
> reach disk eventually--
> [...]
> If the checkpoint interval is as long as the partitioning period, then
> hopefully the active index buffers get re-dirtied while protected in
> shared_buffers, and only get written to disk once.  

Honestly, I made a lot of tests in the past, and I don't remember if I tried
15-minute checkpoints + high shared_buffers. That might work. I'm going to
try it and see what happens.


Jeff Janes wrote
> If the buffers get read, dirtied, and evicted from a small shared_buffers
> over and over again
> then you are almost guaranteed that will get written to disk multiple
> times

(as I understand, but I might be wrong): 
high shared_buffers don't help because in such a random index writing, lots
and lots of pages get dirtied, even if the change in the page was minimal.
So, in the "15-minute" period, you write the same pages over and over again.
Even if you have high shared_buffers, the same page will get sync-ed to disk
multiple times (at every checkpoint).
The idea of those "other" indexes is to avoid the random writing, maximizing
the writing in sequence, even if that means writing more bytes. In other
words: writing a full 8KB is no different than write 20 bytes in a page, as
we'll have to sync the whole page anyway...

I'll try a 15-minute checkpoint interval... and see what happens.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776470.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci

Jeff Janes wrote
> You could periodically merge older partitions into larger tables, index
> those aggregated tables, then transactionally disinherit the old
> partitions
> and inherit the new aggregated one.  This would keep the value of K down,
> at the expense of re-writing data multiple times (but all method write
> data
> multiple times, some just hide it from you).

Forgot to add:

I thought also that we could:

- use the RAM as tablespace for indexes, and move the indexes later (but
postgresql doesn't handle very well a machine crash in this case... it would
be cool to create an index as "recreate on crash"...)
- use unlogged tables and turn those to logged to speed up somehow the
insertion; I actually started to write a patch for it, but making it work
for replication was too hard (not that I'm using replication, but it
wouldn't be accepted for "wal_level = minimal"). But this wouldn't solve the
problem anyway.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776418.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci

Jeff Janes wrote
> Are partitions read-only once time has moved on, or can stragglers show up
> that need to be inserted into older partitions?
> 
> You could periodically merge older partitions into larger tables, index
> those aggregated tables, then transactionally disinherit the old
> partitions
> and inherit the new aggregated one.  This would keep the value of K down,
> at the expense of re-writing data multiple times (but all method write
> data
> multiple times, some just hide it from you).

Yes, we could "merge" the partitions: the idea was to merge them during
night hour, when traffic is low ( and NSA people are sleeping ;) )



Jeff Janes wrote
> By the way, what is the transaction structure of your inserts?  Are they
> large batches between commits, or is each row committed?

Of course large batches (using COPY)




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776416.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci

> Point being: hardware is marching along pretty fast (after 20+ years
> of stagnation) and it's dangerous (IMO) to make big software
> investments based on the situation on the ground *today*.


Yes, that's a good point.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci

Jeff Janes wrote
> The index insertions should be fast until the size of the active part of
> the indexes being inserted into exceeds shared_buffers by some amount
> (what
> that amount is would depend on how much dirty data the kernel is willing
> to
> allow in the page cache before it starts suffering anxiety about it).  If
> you have enough shared_buffers to make that last for 15 minutes, then you
> shouldn't have a problem inserting with live indexes.

Sooner or later you'll have to checkpoint those shared_buffers... and we are
talking about GB of data (my understanding is that we change basically every
btree page, resulting in re-writing of the whole index).






--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776413.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci

> LSM-trees seem patent free

I'm no expert, and I gave it just a look some time ago: it looked to me very 
complicated to get right... and as far as I remember you don't get that much 
gain, unless you go multi-level which would complicate things further

> Please somebody advise patent status of Y-trees otherwise I wouldn't bother.
 
y-trees look much more easier to get right... (and to me they also make more 
sense, but I'm not skilled enough to judge). 

There's also the FD-tree, which looks a lot like the (patented...) fractal tree:
http://www.ntu.edu.sg/home/bshe/fdtree_pvldb.pdf


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci

> What is the reason for needing such fast access to individual groups
> of records? Sure sounds like the NSA or similar ;-)


Users need to search all calls originated from/to a user or from/to a specific 
mobile phone to answer/analyze customers' probl... ok, I give up: I work for 
the NSA ;)

> In terms of generality, do you think its worth a man year of developer
> effort to replicate what you have already achieved? Who would pay?


1) I haven't achieved what I need: realtime indexing. I can't query the 
"current 15 minutes" table efficiently. Plus, K*log(N) is not that great when 
you have a lot of K.
2) I'm not suggesting that this is top priority. I'm asking if there's 
something else, other than "we don't have time for this", that I don't know. In 
fact, I don't even know if those indexes types would really help in my 
(specific) case. That's why my original question was "why aren't there 
developments in this area": I didn't mean to imply someone should do it. I just 
wanted to know if those indexes were already discussed (and maybe dismissed for 
some reason) in the past...



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci

> Presumably the data you are inserting isn't actually random. Please
> describe the use case you are considering in more detail and some view
> on how frequent that is, with some examples. Once we understand the
> use case and agree it is important, we might solve problems.


Collecting calls data for mobile network operators (and no, I don't work for 
the NSA...)
Easily 5000-1 inserts per second. Indexes in timestamp and ID (not a 
problem, always increasing so no btree issues) and in called #, calling #, 
imsi, imei. The last four obviously are random, out of millions of possible 
values.
After the few first millions of records, the disks can't keep up with the 
amount of random writing in the indexes. Workaround: the table is partitioned 
every 15 minutes, and indexes created in bulk after we "start" the new 
15-minutes partition. Searches on current 15 minutes are not allowed (as it is 
not indexed), and searches on older data are K*log(N) (where K is the number of 
partitions). 
Yes, I could throw more disks, use ssd, sharding more, etc etc. But I still 
think that btree just aren't fit for this kind of problem. I don't delete data, 
I don't update data, there's not that much concurrency going on. I would 
sacrifice search speed (K*log(N) is already much slower than "regular" btree 
usage) for realtime insertion.

I don't think I'm the only one having a big system to be indexed by random 
values. 

In fact, I didn't want to turn this thread into a "help me with this workload" 
thread. I just wanted to know if there was some other known issues with these 
"different indexes" other than "not enough time to implement them correctly": I 
was afraid that someone already dismissed them as "good in theory, bad in 
practice"...



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-30 Thread Leonardo Francalanci

> Hmm, you realise Alvaro is working on MinMax indexes in this release?
> They are very efficient with regard to index inserts and specially
> designed for use on large tables.
> 
> Prior work by Heikki on Grouped Item Tuples was a way of reducing the
> size of indexes, yet still allowing uniqueness checks. That is
> implemented in SQLServer already and is very useful.


Reading the implementation of those features, I don't think they can help in 
the cases handled by the index types I mentioned (insertions of random values 
in big tables).



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-29 Thread Leonardo Francalanci

> Hmm, you realise Alvaro is working on MinMax indexes in this release?

> They are very efficient with regard to index inserts and specially
> designed for use on large tables.
> 
> Prior work by Heikki on Grouped Item Tuples was a way of reducing the
> size of indexes, yet still allowing uniqueness checks. That is
> implemented in SQLServer already and is very useful.

Ah! Didn't know that!

> Your comment about the lack of development in indexes seems counter to
> the literature that I've seen. The main problem is people keep
> patenting things, making it fairly difficult for everyone.

Mmh, maybe I wasn't clear: I meant lack of development (maybe I should have 
said "implementation"?) in postgresql and in the other "sql databases" of the 
fast-insertion indexes you can find in literature.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-29 Thread Leonardo Francalanci

> I bet you've mis-diagnosed the problem.  Btrees don't have a problem
> keeping up with 50m records; you're problem is that after a certain
> point your page cache can't keep up with the pseudo-random i/o
> patterns and you start seeing faults to storage.
> [...]   This has nothing to do the btree algorithm except to the
> extent it affects i/o patterns.


Of course; that's why those "different" index types aim to use more sequential 
than random writes.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-29 Thread Leonardo Francalanci

> They should, in theory, be faster than btrees -- O(1) not O(log N) page
> fetches per lookup.  In practice they don't seem to be faster, and
> nobody's bothered to find out exactly why.  Again, this isn't a terribly
> encouraging precedent for implementing some other index type that's
> supposed to (sometimes) be faster than btrees.

Yes, I understand. Which is also why I was curious to know if the "claims" 
those papers (and the databases using them) make were real...

Thank you everybody for your replies.

Leonardo


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-29 Thread Leonardo Francalanci

> Another point to add: I don't really see btree as a barrier to
> performance for most of the problems I face.  The real barriers to
> database performance are storage, contention, and query planning.

Ehm that's true for regular OLTP stuff, which I understand is what most (95%?) 
of people use/need. But if you try to insert rows into a 50M table with a 
couple of indexes, btrees just can't keep up. 
Of course, you can't have it all: fast at big table insertion, good contention, 
good query times... 

> Postgres btreee indexes are pretty fast and for stuff like bulk
> insertions there are some optimization techniques available (such as
> sharding or create index concurrently).


At the moment I'm relying on partitioning + creating indexes in bulk on 
"latest" table (the partitioning is based on time). But that means K*log(N) 
search times (where K is the number of partitions).
That's why I gave a look at these different indexing mechanisms.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fast insertion indexes: why no developments

2013-10-29 Thread Leonardo Francalanci

> Before getting too excited about some new academic index type, it's worth
> noting the sad state in which hash indexes have languished for years.
> Nobody's bothered to add WAL support, let alone do any other real work
> on them.  The non-btree index types that have been getting love are the
> ones that offer the ability to index queries that btree can't.  I think
> a new index type whose only benefit is the claim to be faster in a narrow
> use-case is likely to end up like hash, not getting used enough to be
> properly maintained.
>             regards, tom lane

Aren't hash indexes in a poor state because they are not faster than btree in 
every condition?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Fast insertion indexes: why no developments

2013-10-29 Thread Leonardo Francalanci

Hi,


I don't see much interest in insert-efficient indexes. These are the ones I've 
found:

- LSM-tree (used by Cassandra and SQLite4?)
- Y-Tree 
(http://www.bossconsulting.com/oracle_dba/white_papers/DW%20in%20oracle/P23%20(ytree%20index%20structure%20for%20DWs).pdf
 )
- Fractal indexes (TokuDB, patented)

While I understand that b*trees are still the best compromise in 
insertion/search speed, disk size, concurrency, and more in general in OLTP 
workloads, they are useless when it comes to insertion in big data tables (>50M 
rows) of random values (not ordered values).

I would like to know if the lack of development in this area (not only in 
Postgresql, but in databases in general) is due to:

1) complex implementation
2) poor search performance
3) poor concurrency performance
4) not interesting for most users
5) something else???

I thought this was going to change due to the fast-insertion speeds needs of 
"Social Applications", but only TokuDB seems to be the only "successful" player 
in the area (I don't know how much of it is due to good marketing). Most other 
DB technology claims faster insertion speed (MongoDB and the like...) but in 
the end they rely on the old b*tree + sharding instead of using different 
indexing mechanisms (with the exception of Cassandra).


Thank you in advance

Leonardo



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page Checksums

2011-12-22 Thread Leonardo Francalanci


Agreed.

I do agree with Heikki that it really ought to be the OS problem, but
then we thought that about dtrace and we're still waiting for that or
similar to be usable on all platforms (+/- 4 years).



My point is that it looks like this is going to take 1-2 years in 
postgresql, so it looks like a huge job... but at the same time I 
understand we can't "hope other filesystems will catch up"!



I guess this feature will be tunable (off/on)?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page Checksums

2011-12-21 Thread Leonardo Francalanci


I think what I meant was: isn't this going to be useless in a couple
of years (if, say, btrfs will be available)? Or it actually gives
something that FS will never be able to give?


Yes, it will help you find/address bugs in the filesystem.  These things
are not unheard of...


It sounds to me like a huge job to fix some issues "not unheard of"...

My point is: if we are trying to fix misbehaving drives/controllers 
(something that is more common than one might think), that's already 
done by ZFS on Solaris and FreeBSD, and will be done in btrfs for linux.


I understand not trusting drives/controllers; but not trusting a 
filesystem...



What am I missing? (I'm far from being an expert... I just don't 
understand...)






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page Checksums

2011-12-21 Thread Leonardo Francalanci


On 21/12/2011 16.19, Stephen Frost wrote:

* Leonardo Francalanci (m_li...@yahoo.it) wrote:

I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such
as the one available in ZFS and btrfs?


Depends on how much you trust the filesystem. :)



Ehm I hope that was a joke...


I think what I meant was: isn't this going to be useless in a couple of 
years (if, say, btrfs will be available)? Or it actually gives something 
that FS will never be able to give?


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page Checksums

2011-12-21 Thread Leonardo Francalanci


I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such as 
the one available in ZFS and btrfs?



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Collecting statistics on CSV file data

2011-10-18 Thread Leonardo Francalanci

> New API AnalyzeForeignTable



I didn't look at the patch, but I'm using CSV foreign tables with named pipes
to get near-realtime KPI calculated by postgresql. Of course, pipes can be
read just once, so I wouldn't want an "automatic analyze" of foreign tables...

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is there any plan to add unsigned integer types?

2011-09-26 Thread Leonardo Francalanci

 

> compression is an interesting topic: the guys over at tokudb are
> making some wild claims...i'm curious if they are real, and what the
> real tradeoffs are.


I don't know how much of the performance they claim comes from
compression and how much from the different indexing technique they
use (see the my post here, where nobody answered...

http://archives.postgresql.org/pgsql-general/2011-09/msg00615.php


)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Your Postgresql 9.2 patch

2011-06-28 Thread Leonardo Francalanci

> I've nearly  finished editing prior to commit, so no worries.


Thank you, let me know if I can help.



Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Your Postgresql 9.2 patch

2011-06-28 Thread Leonardo Francalanci

> Leonardo,
> 
> Your patch:
> 
> use less space in xl_xact_commit
> 
> ...  has been waiting on an updated version from you for 10 days now.  Do
> you  think you're likely to complete it for this CommitFest?


I sent an email on the subject:

http://postgresql.1045698.n5.nabble.com/use-less-space-in-xl-xact-commit-patch-tp4400729p4495125.html


and the thread went on and on but now I'm confused about what to do...

Can someone tell me what we want???



Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-06-16 Thread Leonardo Francalanci

> The important thing is that we  retain backwards compatibility with
> current XLOG_XACT_COMMIT. I'm not worried  what we call the other one.


Ok,  let me see if I got it right:


#define XLOG_XACT_COMMIT0x00

should become:

#define XLOG_XACT_COMMIT_WITH_INFO  0x00

and I'll add a 

#define XLOG_XACT_COMMIT  0x60



Than I'll leave 2 structs, "xl_xact_commit_with_info" and
"xl_xact_commit".




Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-06-16 Thread Leonardo Francalanci

> With regards to the naming, I think it would be better if we  kept
> XLOG_XACT_COMMIT record exactly as it is now, and make the  second
> record an entirely new record called XLOG_XACT_COMMIT_FASTPATH.  That
> way we retain backwards compatibility.
> 
> If you'd like to rework  like that please, otherwise I can take it from
> here if you'd  like.


I think I did it; while doing it, I think I've found a bug: I didn't update
"recoveryStopsHere". Please double check that, as I really don't 
know what I'm doing there...
Should I also change the struct name from xl_xact_commit to
xl_xact_commit_fast_path?

> > How can I test it with "weird" stuff as subtransactions,  shared
> > cache invalidation messages...?
> 
> make installcheck should  cover those.


Ok, all tests passed.

commitlog_lessbytes_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-06-15 Thread Leonardo Francalanci

> >  On Wed, May 25, 2011 at 3:05 PM, Simon Riggs   
wrote:
> > Leonardo, can you  submit an updated version of this patch today that
> > incorporates Simon's  suggestion?  

Mmmh, maybe it was simpler than I thought; I must be
missing something... patch attached

How can I test it with "weird" stuff as subtransactions, shared
cache invalidation messages...?

Leonardo

commitlog_lessbytes_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-06-15 Thread Leonardo Francalanci

> On Tue, Jun 14, 2011 at 2:31 PM, Simon Riggs  wrote:
> Well, we certainly have the option to review and commit the patch  any
> time up until feature freeze.  However, I don't want the  CommitFest
> application to be full of entries for patches that are not  actually
> being worked on, because it makes it hard for reviewers to figure  out
> which patches in a state where they can be usefully looked at.   AIUI,
> this one is currently not, because it was reviewed three weeks ago  and
> hasn't been updated.


Yes it's true: I thought I could find the time to work on it, but I didn't.
Let me know the deadline for it, and I'll see if I can make it (or I'll make
it in the next commit fest).


Leonardo


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-06-15 Thread Leonardo Francalanci

> We don't need to be in a hurry here. As the reviewer I'm happy to  give
> Leonardo some time, obviously no more than the end of the commit  fest.
> 
> If he doesn't respond at all, I'll do it, but I'd like to give him  the
> chance and the experience if possible.


Sorry I couldn't update the patch (in fact, it's more of a total-rewrite than
an update).

How much time do I have?



Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-31 Thread Leonardo Francalanci

> Well, I  sort of assumed the design was OK, too, but the more we talk
> about this  WAL-logging stuff, the less convinced I am that I really
> understand the  problem.  :-(


I see. In fact, I think nobody thought about restart points...

To sum up:

1) everything seems ok when in the wal_level = minimal
case. In this case, we fsync everything and at transaction commit
we remove the init fork; in case of a crash, we don't reply
anything (as nothing has been written to the log), and we
remove the main fork as we do now.

2) in the   wal_level != minimal case things become more
complicated: if the standby reaches a restart point
and then crashes we are in trouble: it would remove the 
main fork at startup, and would reply only a portion of
the table.
I guess the same applies to the master too? I mean:
if we log   HEAP_XLOG_NEWPAGEs, reach a checkpoint,
and then crash, at server restart the main fork would be
deleted, and the pages logged on the log couldn't be 
replayed. But the problem on the master can be removed
using another type of log instead of   HEAP_XLOG_NEWPAGE 
(to be replayed by the standbys only).


Is this analysis correct?


Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-31 Thread Leonardo Francalanci

> I think
> we need a detailed design document for how  this is all going to work.
> We need to not only handle the master properly but  also handle the
> slave properly.  Consider, for example, the case where  the slave
> begins to replay the transaction, reaches a restartpoint  after
> replaying some of the new pages, and then crashes.  If the  subsequent
> restart from the restartpoint blows away the main relation fork,  we're
> hosed.  I fear we're plunging into implementation details  without
> having a good overall design in mind first.


As I said in my first post, I'm basing the patch on the post:

http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php


So I assumed the design was ok (except for the "stray file around
on a standby" case, which has been discussed earlier on this thread).

If there are things to be discussed/analyzed (I guess the restart point
thing is one of those) we can do it... but I thought that the whole
design was somehow in place



Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-30 Thread Leonardo Francalanci

> Why is it necessary to replay the operation only on the slave?   Can we
> just use XLOG_HEAP_NEWPAGE?


Uh, I don't know why but I thought I shouldn't log a page on the master,
since all the pages are already there and fsync-ed. But if it makes no harm,
I can easily use   XLOG_HEAP_NEWPAGE (of course, only in the 
 wal_level != minimal case).

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-27 Thread Leonardo Francalanci

> From: Noah Misch 
> > - the  patch is missing the "send all table pages to the
> > standby" part; is  there some code I can use as base?
> 
> Nothing comes to mind as especially  similar.
> 
> > I guess I have to generate some special log type  that
> > is only "played" by standby servers.
> 
> What you described in  your followup mail seemed reasonable.


So, it's ok to have a log item that is replayed only if 

WalRcvInProgress()

is true?

Is it a correct approach? I couldn't find any other way to
find out if we are in a standby or a master...

> > - on the standby, the commit  part should be played as it
> > is on the master (that is, removing the INIT  fork).
> > The abort case is different though: it would mean
> > doing  nothing on the master, while removing every forks
> > but the INIT fork on  the standby.
> > Would it be ok to add to xl_xact_abort a new array  of
> > RelFileNode(s), where for each one at abort all the forks,
> >  except the init fork, have to be deleted by the standby
> > (while the  master shouldn't do anything with them)?
> > I bet there's a cleaner  solution...
> 
> Your "use less space in xl_xact_commit patch" seems to be  going in a good
> direction here.  It would probably also be okay to do a  
>ResetUnloggedRelations()
> on the standby at every abort of a transaction that  had started an UNLOGGED 
->
> LOGGED conversion.  That is, just a flag  might be enough.
 
ok, but that would mean that a transaction that aborts a conversion
would try to reset all unlogged relations (traversing all the FS)... 
I don't know if that's acceptable performance-wise.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Leonardo Francalanci

Sorry, email sent without body.

Fixed some English mistakes.

commitlog_lessbytes02.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Leonardo Francalanci



commitlog_lessbytes02.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Leonardo Francalanci

> Da: Simon Riggs 
> I can't find a clear  discussion of what you are trying to do, and how,
> just a URL back to a  complex discussion on another topic.


While trying to write a patch to allow changing an unlogged table into
a logged one, I had to add another int field to xl_xact_commit.
Robert Haas said:

 "I have to admit I don't like this approach very much.  I can't see 
adding 4 bytes to every commit record for this feature."


which is a correct remark.

xl_xact_commit can contain some arrays (relation to drops, 
committed sub-trans, shared invalidation msgs). The length of
these arrays is specified using 3 ints in the struct. 

So, to avoid adding more ints to the struct, I've been suggested to
remove all the ints, and use   xl_xact_commit.xinfo to flag which
arrays are, in fact, present. 

So the whole idea is:

- remove nrels, nsubxacts and nmsgs from xl_xact_commit
- use bits in xinfo to signal which arrays are present at the end
of   xl_xact_commit 
- for each present array, add the length of the array (as int) at
the end ofxl_xact_commit 
- add each present array after all the lengths


> I can't see any analysis that shows  whether this would be effective in
> reducing space of WAL records and what the  impacts might be.


The space of WAL records should always be <= than without the patch:
in the worst case scenario, all ints are present (as they would without
the patch). In the best case, which I guess is the more common, each
xl_xact_commit will be 3 ints shorter.

I don't think the effect will be "perceivable": the whole idea is to allow
more variable-length structures in xl_xact_commit in the future,
without incrementing the space on disk.

> The patch contains very close to zero  comments.


I tried to add some.
 
 

Leonardo

commitlog_lessbytes01.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-20 Thread Leonardo Francalanci

> - the patch is missing the "send all table pages to  the
> standby" part; is there some code I can use as base?
> I guess I have to  generate some special log type that
> is only "played" by standby  servers.


Maybe I could use log_newpage, but instead of
XLOG_HEAP_NEWPAGE I could use something like
XLOG_HEAP_COPYPAGE; and in heap_redo, in the 
XLOG_HEAP_COPYPAGE case, call heap_xlog_newpage
only in case we're in standby...

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-20 Thread Leonardo Francalanci

I'll try to sum up what I understood:

1) the standby keeps the lock, so no problem with
stray files coming from the unlogged->logged log
reply, as the table can't be read during the operation

2) calling ResetUnloggedRelations before 
ProcArrayApplyRecoveryInfo would remove the problem
of the stray files on the standby in case of master crash
before commit/abort

3) promoting the standby shouldn't be an issue,
since ResetUnloggedRelations is already called in
ShutdownRecoveryTransactionEnvironment


Now, to move forward, some questions:

- the patch is missing the "send all table pages to the
standby" part; is there some code I can use as base?
I guess I have to generate some special log type that
is only "played" by standby servers.

- on the standby, the commit part should be played as it
is on the master (that is, removing the INIT fork).
The abort case is different though: it would mean
doing nothing on the master, while removing every forks
but the INIT fork on the standby.
Would it be ok to add to xl_xact_abort a new array of
RelFileNode(s), where for each one at abort all the forks,
except the init fork, have to be deleted by the standby
(while the master shouldn't do anything with them)?
I bet there's a cleaner solution...



Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-19 Thread Leonardo Francalanci

> I'd guess some WAL record arising  from the post-crash master restart makes 
the
> standby do so.  When a  crash isn't involved, the commit or abort record is 
>that
> signal.  You  could test and find out how it happens after a master crash 
> with 
>a
> procedure  like this:
> 
> 1. Start a master and standby on the same machine.
> 2.  Connect to master; CREATE TABLE t(); BEGIN; ALTER TABLE t ADD c int;
> 3. kill  -9 -`head -n1 $master_PGDATA/postmaster.pid`
> 4. Connect to standby and  confirm that t is still locked.
> 5. Attach debugger to standby startup process  and set breakpoints on
> StandbyReleaseLocks and StandbyReleaseLocksMany.
> 6.  Restart master.


Well yes, based on the test the stack is something like:

StandbyReleaseLocksManyStandbyReleaseOldLocks 
ProcArrayApplyRecoveryInfo  
xlog_redo


It's not very clear to me what ProcArrayApplyRecoveryInfo does (not too
familiar with the standby part I guess) but I see it's called by xlog_redo in
the "info == XLOG_CHECKPOINT_SHUTDOWN" case and by StartupXLOG.

But I don't know if calling ResetUnloggedRelations before 
the call to ProcArrayApplyRecoveryInfo in   xlog_redo makes sense...
if it makes sense, it would solve the problem of the stray files in
the master crashing case I guess?


> > > When you promote the standby,  though,  
> > ShutdownRecoveryTransactionEnvironment()
> > > releases the  locks.


If I understand the code, ResetUnloggedRelations is called before 
ShutdownRecoveryTransactionEnvironment, so that part shouldn't be
an issue 


Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-19 Thread Leonardo Francalanci

> On Wed, May 18, 2011 at 04:02:59PM +0100, Leonardo Francalanci wrote:
> >  > By the time the  startup process
> > > releases the  AccessExclusiveLock acquired by the proposed 
> > > UNLOGGED -> normal  conversion process, that relfilenode
> > > needs to be either  fully  copied or unlinked all over again. 
> > > (Alternately, find some  other  way to make sure queries don't
> > > read the half-copied  file.)  
> > 
> > About this issue: how are AccessExclusiveLocks  released on
> > the standby when the master crashes?
> 
> I assume those  locks remain.  It wouldn't be safe to release them; a master
> crash is  just one kind of WAL receipt latency.

But somehow when the master restarts the standby gets notified it
has the unlock??? 

> When you promote the standby,  though, 
ShutdownRecoveryTransactionEnvironment()
> releases the locks.

Ok; then the problem in the UNLOGGED -> normal  conversion  is that:

1) the master and the standby acquire a lock on the table
2) the master starts sending the pages to the standby
3) the master crashes before committing

up until here no problems, as the standby still has the lock on the
table.

4) when the master restarts, it removes all the fork for rels with INIT forks; 
are those "deletes" sent to the standby? And, if yes,
would those be replayed by the standby *before* releasing the lock?
If the answer is "yes", then I don't think we have a problem... but I think
that at the moment those unlogged-table-forks deletes aren't sent at all.

(When promoting the standby,  we could have
ShutdownRecoveryTransactionEnvironment() remove all the fork for rels
with INIT forks before releasing the locks)

Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-18 Thread Leonardo Francalanci

> By the time the  startup process
> releases the AccessExclusiveLock acquired by the proposed 
> UNLOGGED -> normal conversion process, that relfilenode
> needs to be either  fully copied or unlinked all over again. 
> (Alternately, find some other  way to make sure queries don't
> read the half-copied file.)  

About this issue: how are AccessExclusiveLocks released on
the standby when the master crashes?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-18 Thread Leonardo Francalanci

this is a second version: now using 

 intcounts[1];  /* variable-length array of counts */

in xl_xact_commit to keep track of number of

different arrays at the end of the struct.

Waiting for feedbacks...




Leonardo 

commitlog_lessbytes00.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-18 Thread Leonardo Francalanci

> int  counts[1];  /* variable-length array of counts, xinfo flags  define
> length of array and meaning of counts */



Damn, that's much cleaner than what I did. I don't know why
I stuck with the idea that it had to be:

int
array
int
array
...

instead of: 
int
int
...
array
array
...

which makes much more sense.


> Then, I'd make macros  like this:
> 
> #define XactCommitNumberOfDroppedRelFileNodes(xlrec)  \
>((xlref->xinfo & XACT_COMMIT_DROPPED_RELFILENODES) ?  xlrec->counts[0] : 0)
> #define XactCommitNumberOfCommittedSubXids(xlrec)  \
>((xlref->xinfo & XACT_COMMITED_SUBXDIDS)  ?
> xlrec->counts[(xlrec->xinfo & XACT_COMMIT_DROPPED_RELFILENODES) ?  1 :
> 0] : 0)
> ...etc...


ehm I don't know if macros would be enough; that's ok
for the first 2, then I think it would become a mess...
Maybe I'll use a simple function that gets all "ints" at once.


> ...and a similar set of macros that will  return a pointer to the
> beginning of the corresponding array, if it's  present.  I'd lay out
> the record like this:
> 
> - main record
> -  array of counts (might be zero-length)
> - array of dropped relfilnodes (if  any)
> - array of committed subxids (if any)
> - array of sinval messages (if  any)


ok

> Also, it's important not to confuse xact completion with xact  commit,
> as I think some of your naming does.  Completion could perhaps  be
> thought to include abort.


mmh... I don't know if I got it... but I'll give a look, and ask questions...



Thank you very much for the help


Leonardo


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] use less space in xl_xact_commit patch

2011-05-16 Thread Leonardo Francalanci

Hi,


following the conversation at 

http://postgresql.1045698.n5.nabble.com/switch-UNLOGGED-to-LOGGED-tp4290461p4382333.html


I tried to remove some bytes from xl_xact_commit. 

The way I did it needs palloc+memcpy. I guess it could be done
reusing the memory for smgrGetPendingDeletes. But I don't
think it's that important.

I guess there are other ways of doing it; let me know what
you think.



Leonardo


commitlog_firstv.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-10 Thread Leonardo Francalanci

> I don't  think making xinfo shorter will save anything, because
> whatever follows it is  going to be a 4-byte quantity and therefore
> 4-byte aligned.


ups, didn't notice it.

I'll splitxinfo into:
 
uint16   xinfo;
uint16   presentFlags;


I guess it helps with the reading? I mean, instead
of having a single uint32?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-10 Thread Leonardo Francalanci

> Yes, that seems like a very appealing approach.   There is plenty of
> bit-space available in xinfo, and we could reserve a bit  each for
> nrels, nsubxacts, and nmsgs, with set meaning that an integer count  of
> that item is present and clear meaning that the count is omitted  from
> the structure (and zero).  This will probably require a bit of  tricky
> code reorganization so I think it should be done separately from  the
> main patch.  

Ok, I'll try and send a patch with this change only.
BTW  xinfo  is 32 bit long, but I think only 2 bits are used right now?
I think I can make it a 8 bits, and add another 8 bits for nrels,
nsubxacts, and nmsgs and the new thing. That should save
another 2 bytes, while leaving space for extention. Or we can make
it a 8 bits only, but only 2 bits would be left "empty" for future
extentions; I don't know if we care about it...


Leonardo


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-05-07 Thread Leonardo Francalanci

> On Fri, Apr 22, 2011 at 4:13 AM, Leonardo Francalanci  
wrote:
> >> Maybe  you should change  xl_act_commit to have a separate list of rels to
> >>  drop the init fork for  (instead of mixing those with the list of files  
to
> >> drop as a  whole).
> >
> > I tried to follow your  suggestion, thank you very much.
> 
> I have to admit I don't like this  approach very much.  I can't see
> adding 4 bytes to every commit record  for this feature.


I understand.

What if, in xl_xact_commit, instead of 

RelFileNode xnodes

I use another struct for xnodes, something like:

{
 RelFileNode xnode;
 boolonlyInitFork;
}


That would increase the commit record size only when there are
RelFileNode(s) to drop at commit. So, instead of 4 bytes in
every commit, there are "wasted" bytes when the commit record
contains deleted permanent relations (that should happen much
less). I'm open to suggestions here...


> > 3) Should we have a "cascade" option? I don't know  if I have to handle
> > inherited tables and other dependent  objects
> 
> Look at the way ALTER TABLE [ONLY] works for other action types,  and copy it.


Ok


Thank you very much



Leonardo 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unlogged tables, persistent kind

2011-04-26 Thread Leonardo Francalanci

> > If that 1%  is random (not time/transaction related), usually you'd rather 
>have an empty  table.
> 
> Why do you think it would be random?

"Heap blocks would be zeroed if they were found to be damaged, following a 
crash."

If you erase full blocks, you have no idea what data you erased; it could be
something changed 1 hour ago, 1 month ago, 1 year ago. This is very different 
from,
say, synchronous_commit=off: in that case, "the most recent transactions may be
lost if the database should crash". In your case, "some (who knows which???) 
data
is lost". So, to me that sounds like random loss. I don't think that that is 
different
from a corrupted table. You're not deleting rows recently changed; you're 
deleting
everything that is "physically close" to it.

> > In other  words: is a table that is not consistant with anything else in 
> > the 
>db  useful?
> 
> That's too big a leap. Why would it suddenly be inconsistent with  the
> rest of the database?


If you delete some data, and you have no idea what data you lost, I don't think 
you have a
consistent db. Unless, of course, your table has no relation with any other 
table in the db.

Of course, all these thoughts are based on the assumption that I know what 
happens when a
block is erased; but my knowledge of postgresql internals is not so good, so I 
might be
*very* wrong

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Leonardo Francalanci

> The amount of data loss on a big
> table will be <1% of the data loss
>caused by truncating the whole table.

If that 1% is random (not time/transaction related), usually you'd rather have 
an empty table. In other words: is a table that is not consistant with anything 
else in the db useful?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unlogged tables, persistent kind

2011-04-25 Thread Leonardo Francalanci

The only data we can't rebuild it's the heap. So what about an option for 
UNlogged indexes on a LOGged table? It would always preserve data, and it would 
'only' cost a rebuilding of the indexes in case of an unclean shutdown. I think 
it would give a boost in performance for all those cases where the IO 
(especially random IO) is caused by the indexes, and it doesn't look too 
complicated (but maybe I'm missing something).

I proposed the unlogged to logged patch (BTW has anyone given a look at it?) 
because we partition data based on a timestamp, and we can risk loosing the 
last N minutes of data, but after N minutes we want to know data will always be 
there, so we would like to set a partition table to 'logged'. 

Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What Index Access Method Functions are really needed?

2011-04-22 Thread Leonardo Francalanci

> > another question regarding indexes. Sadly I can't find enough  info in
> > the documentation. Which of the functions are needed in order  for a
> > index to work?
> 
> All of them.


Maybe I completely misunderstood the question, but some functions
are "optionals", such as amgetbitmap, right?

http://www.postgresql.org/docs/9.0/static/index-functions.html

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-04-22 Thread Leonardo Francalanci

> Maybe you should change  xl_act_commit to have a separate list of rels to
> drop the init fork for  (instead of mixing those with the list of files to
> drop as a  whole).


I tried to follow your suggestion, thank you very much.

Here's a first attempt at the patch.  

I "tested" it with:


create table forei (v integer primary key);
insert into forei select * from generate_series(1,1);
create unlogged table pun (c integer primary key, constraint 
con foreign key (c) references forei(v));
insert into pun select * from generate_series(1,1);
alter table pun  set logged;


then shutdown the master with "immediate":

bin/pg_ctl -D data -m immediate stop
bin/pg_ctl -D data start


and "pun" still has data:

select * from pun where c=100;


Question/comments:

1) it's a very-first-stage patch; I would need to know if something is
*very* wrong before cleaning it.
2) there are some things I implemented using a logic like "let's see how it
worked 10 lines above, and I'll do the same". For example, the 2PC stuff
is totally "copied" from the other places, I have no idea if the code makes
sense at all (how can I test it?)
3) Should we have a "cascade" option? I don't know if I have to handle
inherited tables and other dependent objects
4) During the check for dependencies problems, I stop as soon as I find an
error; would it be enough?



Leonardo

unl2log_20110422.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-04-18 Thread Leonardo Francalanci

I think I coded a very basic version of the UNLOGGED to LOGGED patch
(only wal_level=minimal case for the moment).

To remove the INIT fork, I changed somehow PendingRelDelete to have
a flag "bool onlyInitFork" so that the delete would remove only the INIT
fork at commit.

Everything "works" (note the quotes...) except in the case of not-clean
shutdown ("-m immediate" to pg_ctl stop). The reason is that the replay
code doesn't have any idea that it has to remove only the INIT fork: it will
remove ALL forks; so at the end of the redo procedure (at startup, after
a "pg_ctl -m immediate stop") the table doesn't have any forks at all.

See xact_redo_commit:

/* Make sure files supposed to be dropped are dropped */
for (i = 0; i < xlrec->nrels; i++)
{
   [...]
   for (fork = 0; fork <= MAX_FORKNUM; fork++)
 {
  if (smgrexists(srel, fork))
  {
 XLogDropRelation(xlrec->xnodes[i], fork);
 smgrdounlink(srel, fork, true);
  }
 }
 smgrclose(srel);
}
[...]


Should I change xl_xact_commit in order to have, instead of:


/* Array of RelFileNode(s) to drop at commit */
RelFileNode xnodes[1];  /* VARIABLE LENGTH ARRAY */



an array of structures such as:

{
RelFileNode   relfilenode;
bool   onlyInitFork;
}

???

Otherwise I don't know how to tell the redo commit code to delete only
the INIT fork, instead of all the relation forks...
(maybe I'm doing all wrong: I'm open to any kind of suggestion here...)


Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-04-16 Thread Leonardo Francalanci

> > If the  master crashes while a transaction that used CREATE TABLE is 
>unfinished,
> >  both the master and the standby will indefinitely retain identical, stray  
>(not
> > referenced by pg_class) files.  The catalogs do reference the  relfilenode 
of
> > each unlogged relation; currently, that relfilenode never  exists on a 
>standby
> > while that standby is accepting connections.  By the  time the startup 
>process
> > releases the AccessExclusiveLock acquired by  the proposed UNLOGGED -> 
normal
> > conversion process, that relfilenode  needs to be either fully copied or 
>unlinked
> > all over again.   (Alternately, find some other way to make sure queries 
>don't
> > read the  half-copied file.)  In effect, the problem is that the 
> > relfilenode 
>is
> >  *not* stray, so its final state does need to be well-defined.
> 
> Oh,  right.
> 
> Maybe we should just put in a rule that a server in Hot Standby  mode
> won't ever try to read from an unlogged table (right now we count  on
> the fact that there will be nothing to read).  If we crash  before
> copying the whole file, it won't matter, because the catalogs  won't
> have been updated, so we'll refuse to look at it anyway.  And we  have
> to reinitialize on entering normal running anyway, so we can clean  it
> up then.


Ok then... I'll try to code the "easy" version first (the wal_level=minimal 
case)
and then we'll see

Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-04-11 Thread Leonardo Francalanci

> > But re-reading  it, I don't understand: what's the difference in creating
> > a new  "regular" table and crashing before emitting the abort record,
> > and  converting an unlogged table to logged and crashing before
> > emitting the  abort record? How do the standby servers handle a
> > "CREATE TABLE"  followed by a ROLLBACK if the master crashes
> > before writing the abort  record? I thought that too would "leave a
> > stray file around on a  standby".
> 
> I've been thinking about the same thing.  And AFAICS, your  analysis is
> correct, though there may be some angle to it I'm not  seeing.


Anyone else? I would like to know if what I'm trying to do is, in fact,
possible... otherwise starting with thewal_level=minimal case first
will be wasted effort in case the other cases can't be integrated
somehow...



Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] switch UNLOGGED to LOGGED

2011-04-09 Thread Leonardo Francalanci

> I'm pretty sure we wouldn't accept a patch for a  feature that would
> only work with wal_level=minimal, but it might be a useful  starting
> point for someone else to keep hacking on.


I understand.

Reading your post at 
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php
I thought I got the part:

"what happens if we *crash* without writing an abort record?  It
seems  like that could leave a stray file around on a standby, 
because the  current code only cleans things up on the standby
at the start of  recovery"


But re-reading it, I don't understand: what's the difference in creating
a new "regular" table and crashing before emitting the abort record,
and converting an unlogged table to logged and crashing before
emitting the abort record? How do the standby servers handle a 
"CREATE TABLE" followed by a ROLLBACK if the master crashes
before writing the abort record? I thought that too would "leave a
stray file around on a standby".



Leonardo


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] switch UNLOGGED to LOGGED

2011-04-08 Thread Leonardo Francalanci

Hi,

I read the discussion at 


http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php 


From what I can understand, going from/to unlogged to/from logged in
the wal_level == minimal case is not too complicated. 

Suppose I try to write a patch that allows 

ALTER TABLE tablename SET LOGGED (or UNLOGGED)
(proper sql wording to be discussed...)

only in the wal_level == minimal case: would it be accepted as a
"first step"? Or rejected because it doesn't allow it in the other
cases?

From what I got in the discussion, handling the other wal_level cases
can be very complicated (example: the issues in case "we *crash*
without writing an abort record").




Leonardo

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: What do these terms mean in the SOURCE CODE?

2010-11-21 Thread Leonardo Francalanci

> Here's my single biggest tip for newcomers to the Postgres  source:
> if you don't use ctags, glimpse, or some other tool that can  quickly
> show you all references to a given identifier, go out and get  one.
> It's one of the easiest ways to learn about  things.


I guess Eclipse is the best tool for the "newcomer" as it gives
you all the references very easily and shows the functions/structs
declaration when you pass with the mouse over one...

http://wiki.postgresql.org/wiki/Working_with_Eclipse




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci

> Cases with lots of irrelevant indexes.  Zoltan's example had  4 indexes
> per child table, only one of which was relevant to the query.   In your
> test case there are no irrelevant indexes, which is why the  runtime
> didn't change.



Mmh... I must be doing something wrong. It looks to me it's not just
the irrelevant indexes: it's the "order by" that counts. If I remove that
times are the same with and without the patch:

using the test case:

explain select * from inh_parent
where timestamp1 between '2010-04-06' and '2010-06-25'

this one runs in the same time with the patch; but adding:


order by timestamp2

made the non-patched version run 3 times slower.

My test case:

create table t (a integer, b integer, c integer, d integer, e text);

DO $$DECLARE i int;
BEGIN
FOR i IN 0..2000 LOOP
EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || ' 
and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)';
EXECUTE 'create index taidx' || i || ' ON t' || i || '  (a)';
EXECUTE 'create index tbidx' || i || ' ON t' || i || '  (b)';
EXECUTE 'create index tcidx' || i || ' ON t' || i || '  (c)';
EXECUTE 'create index tdidx' || i || ' ON t' || i || '  (d)';
END LOOP;
END$$;


explain select * from t where a > 1060 and a < 109000

this runs in 1.5 secs with and without the patch. But if I add

 order by b 

the non-patched version runs in 10 seconds.

Am I getting it wrong?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci

> Hmm, maybe I spoke too soon.  With 9000 child tables I get  a profile
> like this:


Well, the 9000-table-test-case was meant to check the difference in
performance with/without the patch... I don't see the reason for trying
to optimize such an unrealistic case.

BTW can someone explain to me which are the cases where the
patch actually helps?


 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci

> This is going to be dominated by constraint exclusion  checking.  There's
> basically no fix for that except a more explicit  representation of the
> partitioning rules.  

Damn, I knew that was going to be more complicated :)

So in which case does this patch help? I guess in a multi-index
scenario? childtables.sql is kind of hard to read (I think a FOR loop
would have been more auto-explaining?).




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci

> but I don't get any gain from the patch... explain time is still  around 250 
>ms.
> Tried with 9000 partitions, time is still 2  secs.


Small correction: I tried with 3000 partitions (FOR i IN 0..3000 ...)
and got 250ms with both versions, with 9000 partitions 2 secs (again
no gain from the patch)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci

> On the other hand, if I use a similar test case to my original  one
> (i.e. the tables are much wider) then the query planning takes
> 1.42  seconds in 9.1 with this patch instead of about 4.7 seconds
> as we observed it  using PostgreSQL 9.0.0. The beginning of the gprof
> output now looks like  this:

Hi,

I'm really interested in this patch.

I tried a simple test case:

create table t (a integer, b text);

DO $$DECLARE i int;
BEGIN
FOR i IN 0..9000 LOOP
EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || ' 
and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)';
   EXECUTE 'create index tidx' || i || ' ON t' || i || '  (a)';
END LOOP;
END$$;

explain select * from t where a > 1060 and a < 1090;

but I don't get any gain from the patch... explain time is still around 250 ms.

Tried with 9000 partitions, time is still 2 secs.


Maybe I've missed completely the patch purpose?


(I tried the test case at

http://archives.postgresql.org/message-id/4cbd9ddc.4040...@cybertec.at

and that, in fact, gets a boost with this patch).



Leonardo




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Custom aggragation function that creates an array

2010-10-23 Thread Leonardo Francalanci

> Before you start coding, have you looked over what's  in
> contrib/intarray ?


There's nothing that fulfills my needs there, but I guess it
would be the perfect place to watch for code examples!

Thank you

(I think I made my custom aggregation function work,
but I'll look into intarray code to see if I made mistakes...)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Custom aggragation function that creates an array

2010-10-22 Thread Leonardo Francalanci

Hi,

I want to write a custom agg function that, given an "int4 index",
increments the element at "index" of an array and, at the end, returns
the array.  The array will always be int4[].

I need it in C, since plpgsql is way slower (and I need to use it in
5M+ rows). I did it, but I also need to create the array "on demand",
not as initcond (the array size will be a parameter).

I suppose that, since I'll be using "construct_array" (if it makes
sense), I'll need a new MemoryContext?
I've looked at array_agg_transfn, but I think that's a lot more
complicated because it'll repalloc, so it has to keep
the memorycontext...

Since I'm not familiar with these things, would something like this
work?

if (PG_ARGISNULL(0))
{ // first time, alloc array
  arr_context = AllocSetContextCreate(rcontext,
   
 "myCustomArrayResult",
 ALLOCSET_DEFAULT_MINSIZE,
 ALLOCSET_DEFAULT_INITSIZE,
 ALLOCSET_DEFAULT_MAXSIZE);
  oldcontext = MemoryContextSwitchTo(arr_context);
  transarray = construct_array([...]);
  MemoryContextSwitchTo(oldcontext );
}
else
{
 transarray = PG_GETARG_ARRAYTYPE_P(0);
}



This is how the C function is written right now:

ArrayType  *transarray;
int32   index = PG_GETARG_INT32(1);
int32   *array;

if (AggCheckCallContext(fcinfo, NULL))
{
   // code above would go in place of this line:
 transarray = PG_GETARG_ARRAYTYPE_P(0);
}
else
 transarray = PG_GETARG_ARRAYTYPE_P_COPY(0);

array = (int32*)ARR_DATA_PTR(transarray);
array[index]++;

PG_RETURN_ARRAYTYPE_P(transarray);

(checks on boundaries etc omitted)


The problem is that it expects the array to be already available, so I had to 
do:


CREATE AGGREGATE cust_agg_func (int4)
(
sfunc = cust_agg_funct_add,
stype = int4[],
initcond = '{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
);


I would like to have:

CREATE AGGREGATE cust_agg_func (int4 index, int4 
array_size_to_be_created)
(
sfunc = cust_agg_funct_with_array_creation_add,
stype = int4[]
);




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On the usefulness of hint bits

2010-10-11 Thread Leonardo Francalanci

> I wonder if we could improve  this with some sort of process-local
> cache - not to get rid of hint bits,  just to reduce pg_clog
> contention.  We might easily end up testing the  same XID many times
> during the same table scan.

 
I guess that's my scenario... not that many transactions, so even
"copying" the whole pg_clog table in the per-process memory would
be doable...




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On the usefulness of hint bits

2010-10-11 Thread Leonardo Francalanci

> Reduction of contention for pg_clog  access, for one thing.  If you read
> the archives, you'll find that  pg_clog access contention has been shown
> to be one cause of "context swap  storms".  Having to go to clog for
> every single tuple access would make  that orders of magnitude worse.


Ok; is it the 

"Wierd context-switching issue on Xeon" 

thread? Or that has nothing to do with it? I tried  "context swap  storms
pg_clog" but I didn't get anything...

> More generally, we're not going to give  up hint bits even if there are
> identifiable workloads where they don't buy  much --- because there are
> many others where they  do.


Sure: I wasn't suggesting to give up them, just to make their usage
"user selectable" (and "on" by default)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] On the usefulness of hint bits

2010-10-11 Thread Leonardo Francalanci

Hi,


I was wondering what is the advantage of having hint bits for OLAP
-style workloads, that is when the number of transactions is not
that high.

If I got it right, in 10 pg_clog pages we can store the status for more
than 32 transactions. That's a lot, in a very small space
(80KB?).

So I was wondering what's the gain we get from hint bits in cases
where pg_clog is "small" (that is, will be cached by postgresql/the
OS).

does somebody have some numbers on the effect of hint bits on
first/second reads?

I mean:

create mytable as .

select * from mytable -> this one will update hint bits
select * from mytable -> this one will use them

to test it I guess we should change the code to have:

a version where they are never updated (that is, always look at 
pg_clog)
(so that you don't have to write them the first time, and you'll get the
"true" reading time + pg_clog reading time)


a version that always set them to COMMITTED
(so that you don't have to write them the first time, and you'll get the
"true" reading time for the "second" read that would use them,
regardless of any pg/OS cache)


I'm asking because I don't like having all those writes on the
first scan... and I would like to know what the real benefit is for the 
reads that come after the first one in case there are "few"
transactions per second (for example, less than 1 transaction per
second) 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: I: [HACKERS] About "Our CLUSTER implementation is pessimal" patch

2010-10-08 Thread Leonardo Francalanci

> Applied with some  significant editorialization.  The biggest problem I
> found was that the  code for expression indexes didn't really work, and
> would leak memory like  there's no tomorrow even when it did work.


Sorry I couldn't write the way it was supposed to... I'll look at the
differences to see what I did wrong... (so maybe next time I'll do
better!)


Thank you

Leonardo




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: I: [HACKERS] About "Our CLUSTER implementation is pessimal" patch

2010-10-04 Thread Leonardo Francalanci

> It sounds like the costing model might need a bit more work before  we commit 
>this.


I tried again the simple sql tests I posted a while ago, and I still get the 
same ratios.
I've tested the applied patch on a dual opteron + disk array Solaris machine.

I really don't get how a laptop hard drive can be faster at reading data using 
random
seeks (required by the original cluster method) than seq scan + sort for the 5M 
rows
test case. 
Same thing for the "cluster vs bloat" test: the seq scan + sort is faster on my 
machine.

I've just noticed that Josh used shared_buffers = 16MB for the "cluster vs 
bloat" test:
I'm using a much higher shared_buffers (I think something like 200MB), since if
you're working with tables this big I thought it could be a more appropriate 
value.
Maybe that's the thing that makes the difference???

Can someone else test the patch?

And: I don't have that deep knowledge of how postgresql deletes rows; but I 
thought
that something like:

DELETE FROM mybloat WHERE RANDOM() < 0.9;

would only delete data, not indexes; so the patch should perform even better in 
this
case (as it does, in fact, on my test machine), as:

- the original cluster method would read the whole index, and fetch only the 
"still alive"
rows
- the new method would read the table using a seq scan, and sort in memory the 
few
rows still alive

But, as I said, maybe I'm getting this part wrong...





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: I: [HACKERS] About "Our CLUSTER implementation is pessimal" patch

2010-10-01 Thread Leonardo Francalanci

> I ran a few more performance tests on this patch. Here's what  I got
> for the tests Leonardo posted originally:
>* 2M  rows:  22 seconds for seq. scan, 24 seconds for index scan
>* 5M  rows:  139 seconds for seq. scan, 97 seconds for index scan
>*  10M rows: 256 seconds seq. scan, 611 seconds for index scan

I don't have time right now to run more tests, I'll try to make some by
next week.

Would it mean that doing:

create table p as select * from atable order by akey

(where akey is random distributed)
with 5M rows is faster with enable_seqscan=0 and
enable_indexscan=1??? That would be weird, especially on a
laptop hard drive! (assuming there's a reasonable amount of
memory set in work_mem/maintenance_work_mem)
 
> I tried a few more tests of creating a table with  either 10M or 50M
> rows, then deleting 90% of the rows and running a cluster.  The patch
> didn't fare so well here:

>  * 10M rows: 84 seconds for seq.  scan, 44 seconds for index scan

[...]
> So I think there are  definitely cases where this patch helps, but it
> looks like a seq. scan is  being chosen in some cases where it doesn't
> help.
> 
> Test machine:  MacBook Pro laptop, C2D 2.53 GHz, 4GB RAM.


Again: what would the planner choose in that case for a:

create table p as select * from mybloat order by myid

???

I guess that if the planner makes a wrong choice in this case (that is,
seq scan + sort instead of index scan) there's no way for "cluster" to
behave in a different way. If, on the contrary, the "create table..." uses
the right plan, and cluster doesn't, we have a problem in the patch.  
Am I right?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: I: [HACKERS] About "Our CLUSTER implementation is pessimal" patch

2010-09-29 Thread Leonardo Francalanci

> > Here's my post with a (very simple) performance test:
> > http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php
> I  think the 10M rows test is more in line with what we want (83s vs.  646).


Can someone else test the patch to see if what I found is still valid?
I don't think it makes much sense if I'm the only one that says
"this is faster" :)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: I: [HACKERS] About "Our CLUSTER implementation is pessimal" patch

2010-09-29 Thread Leonardo Francalanci

> > 10% is nothing.  I was expecting this  patch would give an order of
> > magnitude of improvement or somethine like  that in the worst cases of
> > the current code (highly unsorted  input)
> 
> Yes. It should be x10 faster than ordinary method in the worst  cases.


Here's my post with a (very simple) performance test:

http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php


The test I used wasn't a "worst case" scenario, since it is based on
random data, not wrong-ordered data. Obviously, the real difference
can be seen on large tables (5M+ rows), and/or slow disks.


Leonardo




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: I: [HACKERS] About "Our CLUSTER implementation is pessimal" patch

2010-07-21 Thread Leonardo Francalanci

> I think writetup_rawheap() and readtup_rawheap() are a little  complex,
> but should work as long as there are no padding between t_len and  t_self
> in HeapTupleData struct.
> 
> - It might be cleaner if you write the  total item length
>   and tuple data separately.
> - "(char *) tuple +  sizeof(tuplen)" might be more robust
>   than  "&tuple->t_self".


- I used your functions 
- changed the docs for CLUSTER (I don't know if they make sense/are enough)
- added a minor comment


2 questions:
 
1) about the "copy&paste from FormIndexDatum" comment: how can I improve it?
The idea is that we could have a faster call, but it would mean copying and
pasting a lot of code from FormIndexDatum.

2) what other areas can I comment more?


  

sorted_cluster-20100721.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

87 matches

Mail list logo