> Joseph Turner <[EMAIL PROTECTED]> writes: > > I have a table with a decent number of rows (let's say for example a > > billion rows). I am trying to construct a graph that displays the > > distribution of that data. However, I don't want to read in the > > complete data set (as reading a billion rows would take a while). Can > > anyone thing of a way to do this is postgresql?
One way would be to have an indexed column with random values in it. Then you could use an index scan to pull out samples. However this has a few downsides. a) index scans are a lot less efficient than sequential scans. Effectively reducing the sample size you can get for a given amount of time even further. a 10% sample using this technique is probably almost as slow as reading the entire table, for example. If you only need .1% though this might be a good approach. b) the data in the random column would have to be static meaning multiple samples wouldn't be entirely independent. Depending on what you're doing with the samples this might be a problem or not. Oh, and if you're content with always using the same sample but want to analyze it multiple different ways, you might want to use a partial index. You could have partial indexes matching the order by and where clauses of your analyses but with a where clause of its own selecting on the random data. Then make sure that where clause is in every query. But still, the sample has to be small enough that even using an index scan you're winning over reading the entire data. -- greg ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly