Drew, PgPool has to get all of the rows to order them, then it can limit how many are returned. Unfortunately, there is no magic distributed index that can help with this operation. (Imagine fetching a row through an index on a backend database and comparing it to all other backend databases to determine if that row is or is not included in the result set. Yikes!) You will find similar issues with group by and other clauses that require a sort. Subqueries can be problematic, too.
Note the data is returned to the PG cluster running pgpool for your sort and limit operations. This operation used dblink, which allocates memory to hold the rows and can exhaust memory if the amount of data is too large. If you want to run a partitioned datawarehouse I suggest looking at Greenplum. If you have targeted data to retrieve, pgpool can do that. PgPool works well in parallel mode if you remember these limitations. I'm running several ginormous databases with it and get excellent response. Good luck! Bob Lunney ________________________________ From: Drew Robb <[email protected]> To: [email protected] Sent: Thursday, June 2, 2011 11:17 AM Subject: [Pgpool-general] Parallel query order by / limit Hello, I'm fairly new to pgpool (enjoying it so far) and I'm trying to understand parallel query behavior. I'm using pgpool 3.0.4 and pqsql 9.0. I have a parallel table USERS distributed via the primary key user_id. When I execute the following query select user_id, date_created from users order by date_created limit 10; > I get the following output from pgpool debug: statement2: SELECT pool_parallel("SELECT user_id, users.date_created FROM users") > My understanding is that the parallel engine is unable to translate the order by and limit clause to into the query executed on the worker nodes? This creates a big performance problem since all records in the table are seemingly being processed. I read the documentation, but was unable to figure out why this is the case. It seems to me that the order by and limit clauses could be applied to both the workers (returning upto N*10 records) and then reapplied to filter these results down to just 10 records. Is the behavior i'm seeing expected, does the optimization I had in mind make sense? Either way, is there a way that I could specify exactly the parallel query to be executing on the workers to avoid this problem? I know that creating a particular view in and registering it in the dist_def table might work, but this isn't very dynamic. Thanks, Drew _______________________________________________ Pgpool-general mailing list [email protected] http://pgfoundry.org/mailman/listinfo/pgpool-general _______________________________________________ Pgpool-general mailing list [email protected] http://pgfoundry.org/mailman/listinfo/pgpool-general
