Re: [SQL] [PERFORM] EXTERNAL storage and substring on long strings

Richard Huxton Mon, 04 Aug 2003 08:58:34 -0700

On Monday 04 August 2003 16:25, Scott Cain wrote:
[snip]
> In an effort to find the best way to do this operation, I decided to
> look at what is my "worst case" scenario: the DNA sequence for human
> chromosome 1, which is about 250 million characters long (previous
> strings where about 20 million characters long).  I wrote a perl script
> to do several substring operations over this very long string, with
> substring lengths varying between 1000 and 40,000 characters spread out
> over various locations along the string.  While EXTENDED storage won in
> this case, it was a hollow victory: 38 seconds per operation versus 40
> seconds, both of which are way too long to for an interactive
> application.
>
> Time for a new method.  A suggestion from my boss was to "shred" the DNA
> into smallish chunks and a column giving offsets from the beginning of
> the string, so that it can be reassembled when needed. Here is the test
> table:
>
> string=> \d dna
>       Table "public.dna"
>  Column  |  Type   | Modifiers
> ---------+---------+-----------
>  foffset | integer |
>  pdna    | text    |
> Indexes: foffset_idx btree (foffset)


[snipped plpgsql function which stitches chunks together and then substrings]

> So here's the question: I've never written a plpgsql function before, so
> I don't have much experience with it; is there anything obviously wrong
> with this function, or are there things that could be done better?  At
> least this appears to work and is much faster, completing substring
> operations like above in about 0.27 secs (that's about two orders of
> magnitude improvement!)

You might want some checks to make sure that smin < smax, otherwise looks like 
it does the job in a good clean fashion.

Glad to hear it's going to solve your problems. Two things you might want to 
bear in mind:
1. There's probably a "sweet spot" where the chunk size interacts well with 
your data, usage patterns and PGs backend to give you peak performance. 
You'll have to test.
2. If you want to search for a sequence you'll need to deal with the case 
where it starts in one chunk and ends in another.

-- 
  Richard Huxton
  Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

               http://archives.postgresql.org

Re: [SQL] [PERFORM] EXTERNAL storage and substring on long strings

Reply via email to