Drew,

PgPool has to get all of the rows to order them, then it can limit how many are 
returned.  Unfortunately, there is no magic distributed index that can help 
with this operation.  (Imagine fetching a row through an index on a backend 
database and comparing it to all other backend databases to determine if that 
row is or is not included in the result set.  Yikes!)  You will find similar 
issues with group by and other clauses that require a sort.  Subqueries can be 
problematic, too.

Note the data is returned to the PG cluster running pgpool for your sort and 
limit operations.  This operation used dblink, which allocates memory to hold 
the rows and can exhaust memory if the amount of data is too large.  If you 
want to run a partitioned datawarehouse I suggest looking at Greenplum.  If you 
have targeted data to retrieve, pgpool can do that.


PgPool works well in parallel mode if you remember these limitations.  I'm 
running several ginormous databases with it and get excellent response.


Good luck!

Bob Lunney
________________________________
From: Drew Robb <[email protected]>
To: [email protected]
Sent: Thursday, June 2, 2011 11:17 AM
Subject: [Pgpool-general] Parallel query order by / limit


Hello, I'm fairly new to pgpool (enjoying it so far) and I'm trying to 
understand parallel query behavior. I'm using pgpool 3.0.4 and pqsql 
9.0.

I have a parallel table USERS distributed via the primary key user_id.  When I 
execute the following query

select user_id, date_created from users order by date_created limit 10;
>
I get the following output from pgpool debug:

statement2: SELECT pool_parallel("SELECT user_id, users.date_created FROM 
users")
>

My understanding is that the parallel engine is unable to translate the order 
by 
and limit clause to into the query executed on the worker nodes? This 
creates a big performance problem since all records in the table are 
seemingly being processed.  I read the documentation, but was unable to 
figure out why this is the case. It seems to me that the order by and 
limit clauses could be applied to both the workers (returning upto N*10 
records) and then reapplied to filter these results down to just 10 
records. Is the behavior i'm seeing expected, does the optimization I had in 
mind make sense?

Either way, is there a way that I could specify exactly the parallel query 
to be executing on the workers to avoid this problem? I know that 
creating a particular view in and registering it in the dist_def table might 
work, but this isn't very dynamic.

Thanks,
Drew
_______________________________________________
Pgpool-general mailing list
[email protected]
http://pgfoundry.org/mailman/listinfo/pgpool-general
_______________________________________________
Pgpool-general mailing list
[email protected]
http://pgfoundry.org/mailman/listinfo/pgpool-general

Reply via email to