Re: [PERFORM] Dealing with big tables

Sami Dalouche Sun, 02 Dec 2007 03:06:48 -0800

Hi,

my answer may be out of topic since you might be looking for a
postgres-only solution.. But just in case....


What are you trying to achieve exactly ? Is there any way you could
re-work your algorithms to avoid selects and use a sequential scan
(consider your postgres data as one big file) to retrieve each of the
rows, analyze / compute them (possibly in a distributed manner), and
join the results at the end ? 

A few pointers :
http://lucene.apache.org/hadoop/
http://www.gridgain.com/

Regards,
Sami Dalouche



On Sun, 2007-12-02 at 12:26 +0200, Mindaugas wrote:
> Hello,
> 
>   Started to work with big tables (like 300GB) and performance problems 
> started to appear. :(
> 
>   To simplify things - table has an index on From an index on To columns. And 
> it also have several other not indexed columns. There are 100000+ of 
> different values for From and the same for To.
> 
>   I execute simple query "select * from bigtable where From='something'". 
> Query returns like 1000 rows and takes 5++ seconds to complete. As far as I 
> understand the query is slow because:
>   - first it has to retrieve pointers to rows with data from index. That goes 
> fast.
>   - retrieve all the rows one by one. There we have 100% random read because 
> rows with the same From is distributed evenly through all the 300GB and most 
> probably nothing is cached. So there we are limited by _one_ disk performance 
> independently of how many disks we have in storage? And in case storage 
> consists of 15k rpm Cheetahs with 3.5ms average read seek time we should 
> expect not more than ~285 rows per second?
> 
>   I feel that I'm overlooking something here. But I'm new into data 
> warehousing. :)
> 
>   Also this query should greatly benefit from parallel execution or async IO. 
> Storage (seeks/second) scales almost linearly when it has a lot of disks. And 
> query is completely IO bound so it should scale well on single server.
> 
>   And I cannot use some index organized table or table partitioned by From :) 
> because there are at least 2 similar indexes by which queries can be executed 
> - From and To.
> 
>   Ideas for improvement? Greenplum or EnterpriseDB? Or I forgot something 
> from PostgreSQL features.
> 
>   Thanks,
> 
>   Mindaugas
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
> 
>                 http://www.postgresql.org/about/donate


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: [PERFORM] Dealing with big tables

Reply via email to