Re: [SQL] Selecting "sample" data from large tables.

Greg Stark Thu, 03 Jun 2004 23:14:55 -0700

> Joseph Turner <[EMAIL PROTECTED]> writes:
> > I have a table with a decent number of rows (let's say for example a
> > billion rows).  I am trying to construct a graph that displays the
> > distribution of that data.  However, I don't want to read in the
> > complete data set (as reading a billion rows would take a while).  Can
> > anyone thing of a way to do this is postgresql?


One way would be to have an indexed column with random values in it. Then you
could use an index scan to pull out samples.

However this has a few downsides. 

a) index scans are a lot less efficient than sequential scans. Effectively
reducing the sample size you can get for a given amount of time even further.
a 10% sample using this technique is probably almost as slow as reading the
entire table, for example. If you only need .1% though this might be a good
approach.

b) the data in the random column would have to be static meaning multiple
samples wouldn't be entirely independent. Depending on what you're doing with
the samples this might be a problem or not.

Oh, and if you're content with always using the same sample but want to
analyze it multiple different ways, you might want to use a partial index. You
could have partial indexes matching the order by and where clauses of your
analyses but with a where clause of its own selecting on the random data. Then
make sure that where clause is in every query.

But still, the sample has to be small enough that even using an index scan
you're winning over reading the entire data.

-- 
greg


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Re: [SQL] Selecting "sample" data from large tables.

Reply via email to