Re: [GENERAL] Are there any options to parallelize queries?

2012-09-05 Thread Seref Arikan
Thanks Aleksey,
Definitely worth noting. Impressive scalability according to slides. The
use of Java is particularly interesting to me.

Best regards
Seref


On Wed, Sep 5, 2012 at 6:27 AM, Aleksey Tsalolikhin atsaloli.t...@gmail.com
 wrote:

 Hi, Seref.  You might want to take a look at Stado:
 http://www.slideshare.net/jim_mlodgenski/scaling-postresql-with-stado

 Best,
 -at



Re: [GENERAL] Are there any options to parallelize queries?

2012-09-04 Thread Michael Paquier
On Wed, Aug 22, 2012 at 7:21 PM, Chris Travers chris.trav...@gmail.comwrote:

 Does Postgres-XC support query parallelism (at least splitting the
 query up for portions that run on different nodes)?  They just
 released 1.0.  I don't know if this sort of thing is supported there
 and it might be overkill at any rate.

Yes it does.
There are things implemented in Postgres-XC planner that allows to ship to
remote nodes portion of the query if necessary.
-- 
Michael Paquier
http://michael.otacoo.com


Re: [GENERAL] Are there any options to parallelize queries?

2012-09-04 Thread Aleksey Tsalolikhin
Hi, Seref.  You might want to take a look at Stado:
http://www.slideshare.net/jim_mlodgenski/scaling-postresql-with-stado

Best,
-at


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Are there any options to parallelize queries?

2012-08-22 Thread Seref Arikan
Craid and Pavel: thanks to you both for the responses.

Craig, this is for my PhD work, so no commercial interest at this point.
However, I'm pushing very hard at various communities for funding/support
for a Postgres based implementation of an EHR repository, that'll hopefully
benefit from my PhD efforts. I'll certainly add the option of funding some
key work into those discussions, which actually fits the model that we've
been discussing at the university for some time very well.

Kind regards
Seref


On Wed, Aug 22, 2012 at 4:24 AM, Craig Ringer ring...@ringerc.id.au wrote:

 On 08/21/2012 04:45 PM, Seref Arikan wrote:

  Parallel software frameworks such as Erlang's OTP or Scala's Akka do
 help a lot, but it would be a lot better if I could feed those
 frameworks with data faster. So, what options do I have to execute
 queries in parallel, assuming a transactional system running on
 postgresql?


 AFAIK Native support for parallelisation of query execution is currently
 almost non-existent in Pg. You generally have to break your queries up into
 smaller queries that do part of the work, run them in parallel, and
 integrate the results together client-side.

 There are some tools that can help with this. For example, I think
 PgPool-II has some parallelisation features, though I've never used them.
 Discussion I've seen on this list suggests that many people handle it in
 their code directly.

 Note that Pg is *very* good at concurently running many queries, with
 features like synchronized scans. The whole DB is written around fast
 concurrent execution of queries, and it'll happily use every CPU and I/O
 resource you have. However, individual queries cannot use multiple CPUs or
 I/O threads, you need many queries running in parallel to use the
 hardware's resources fully.


 As far as I know the only native in-query parallelisation Pg offers is via
 effective_io_concurrency, and currently that only affects bitmap heap scans:

 
 http://archives.postgresql.**org/pgsql-general/2009-10/**msg00671.phphttp://archives.postgresql.org/pgsql-general/2009-10/msg00671.php

 ... not seqscans or other access methods.

 Execution of each query is done with a single process running a single
 thread, so there's no CPU parallelism except where the compiler can
 introduce some behind the scenes - which isn't much. I/O isn't parallelised
 across invocations of nested loops, by splitting seqscans up into chunks,
 etc either.

 There are some upsides to this limitation, though:

 - The Pg code is easier to understand, maintain, and fix

 - It's easier to add features

 - It's easier to get right, so it's less buggy and more
   reliable.


 As the world goes more and more parallel Pg is likely to follow at some
 point, but it's going to be a mammoth job. I don't see anyone volunteering
 the many months of their free time required, there's nobody being funded to
 work on it, and I don't see any of the commercial Pg forks that've added
 parallel features trying to merge their work back into mainline.

 If you have a commercial need, perhaps you can find time to fund work on
 something that'd help you out, like honouring effective_io_concurrency for
 sequential scans?

 --
 Craig Ringer



Re: [GENERAL] Are there any options to parallelize queries?

2012-08-22 Thread Chris Travers
Does Postgres-XC support query parallelism (at least splitting the
query up for portions that run on different nodes)?  They just
released 1.0.  I don't know if this sort of thing is supported there
and it might be overkill at any rate.

Best Wishes,
Chris Travers


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Are there any options to parallelize queries?

2012-08-21 Thread Seref Arikan
Dear all,
I am designing an electronic health record repository which uses postgresql
as its RDMS technology. For those who may find the topic interesting, the
EHR standard I specialize in is openEHR: http://www.openehr.org/

My design makes use of parallel execution in the layers above DB, and it
seems to scale quite good. However, I have a scale problem at hand. A
single patient can have up to 1 million different clinical data entries on
his/her own, after a few decades of usage. Clinicians do love their data,
and especially in chronic disease management, they demand access to
whatever data exists. If you have 20 years of data for a diabetics patient
for example, they'll want to look for trends in that, or even scroll
through all of it, maybe with some filtering.
My requirement is to be able to process those 1 million records as fast as
possible. In case of population queries, we're talking about billions of
records. Each clinical record, (even with all the optimizations our domain
has developed in the last 30 or so years), leads to a number of rows, so
you can see that this is really big data. (imagine a national diabetes
registry with lifetime data of a few million patients)
I am ready to consider Hadoop or other non-transactional approaches for
population queries, but clinical care still requires that I process
millions of records for a single patient.

Parallel software frameworks such as Erlang's OTP or Scala's Akka do help a
lot, but it would be a lot better if I could feed those frameworks with
data faster. So, what options do I have to execute queries in parallel,
assuming a transactional system running on postgresql? For example I'd like
to get last 10 years' records in chunks of 2 years of data, or chunks of 5K
records, fed to N number of parallel processing machines. The clinical
system should keep functioning in the mean time, with new records added etc.
PGPool looks like a good option, but I'd appreciate your input. Any proven
best practices, architectures, products?

Best regards
Seref


Re: [GENERAL] Are there any options to parallelize queries?

2012-08-21 Thread Pavel Stehule
Hello

2012/8/21 Seref Arikan serefari...@kurumsalteknoloji.com:
 Dear all,
 I am designing an electronic health record repository which uses postgresql
 as its RDMS technology. For those who may find the topic interesting, the
 EHR standard I specialize in is openEHR: http://www.openehr.org/


http://stormdb.com/community/stado?destination=node%2F8

Regards

Pavel Stehule


 My design makes use of parallel execution in the layers above DB, and it
 seems to scale quite good. However, I have a scale problem at hand. A single
 patient can have up to 1 million different clinical data entries on his/her
 own, after a few decades of usage. Clinicians do love their data, and
 especially in chronic disease management, they demand access to whatever
 data exists. If you have 20 years of data for a diabetics patient for
 example, they'll want to look for trends in that, or even scroll through all
 of it, maybe with some filtering.
 My requirement is to be able to process those 1 million records as fast as
 possible. In case of population queries, we're talking about billions of
 records. Each clinical record, (even with all the optimizations our domain
 has developed in the last 30 or so years), leads to a number of rows, so you
 can see that this is really big data. (imagine a national diabetes registry
 with lifetime data of a few million patients)
 I am ready to consider Hadoop or other non-transactional approaches for
 population queries, but clinical care still requires that I process millions
 of records for a single patient.

 Parallel software frameworks such as Erlang's OTP or Scala's Akka do help a
 lot, but it would be a lot better if I could feed those frameworks with data
 faster. So, what options do I have to execute queries in parallel, assuming
 a transactional system running on postgresql? For example I'd like to get
 last 10 years' records in chunks of 2 years of data, or chunks of 5K
 records, fed to N number of parallel processing machines. The clinical
 system should keep functioning in the mean time, with new records added etc.
 PGPool looks like a good option, but I'd appreciate your input. Any proven
 best practices, architectures, products?

 Best regards
 Seref



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Are there any options to parallelize queries?

2012-08-21 Thread Craig Ringer

On 08/21/2012 04:45 PM, Seref Arikan wrote:


Parallel software frameworks such as Erlang's OTP or Scala's Akka do
help a lot, but it would be a lot better if I could feed those
frameworks with data faster. So, what options do I have to execute
queries in parallel, assuming a transactional system running on
postgresql?


AFAIK Native support for parallelisation of query execution is currently 
almost non-existent in Pg. You generally have to break your queries up 
into smaller queries that do part of the work, run them in parallel, and 
integrate the results together client-side.


There are some tools that can help with this. For example, I think 
PgPool-II has some parallelisation features, though I've never used 
them. Discussion I've seen on this list suggests that many people handle 
it in their code directly.


Note that Pg is *very* good at concurently running many queries, with 
features like synchronized scans. The whole DB is written around fast 
concurrent execution of queries, and it'll happily use every CPU and I/O 
resource you have. However, individual queries cannot use multiple CPUs 
or I/O threads, you need many queries running in parallel to use the 
hardware's resources fully.



As far as I know the only native in-query parallelisation Pg offers is 
via effective_io_concurrency, and currently that only affects bitmap 
heap scans:


http://archives.postgresql.org/pgsql-general/2009-10/msg00671.php

... not seqscans or other access methods.

Execution of each query is done with a single process running a single 
thread, so there's no CPU parallelism except where the compiler can 
introduce some behind the scenes - which isn't much. I/O isn't 
parallelised across invocations of nested loops, by splitting seqscans 
up into chunks, etc either.


There are some upsides to this limitation, though:

- The Pg code is easier to understand, maintain, and fix

- It's easier to add features

- It's easier to get right, so it's less buggy and more
  reliable.


As the world goes more and more parallel Pg is likely to follow at some 
point, but it's going to be a mammoth job. I don't see anyone 
volunteering the many months of their free time required, there's nobody 
being funded to work on it, and I don't see any of the commercial Pg 
forks that've added parallel features trying to merge their work back 
into mainline.


If you have a commercial need, perhaps you can find time to fund work on 
something that'd help you out, like honouring effective_io_concurrency 
for sequential scans?


--
Craig Ringer


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general