Marc,

You should expect that for the kind of OLAP workload you describe in steps 2
and 3 you will have exactly one CPU working for you in Postgres.

If you want to accelerate the speed of this processing by a factor of 100 or
more on this machine, you should try Greenplum DB which is Postgres 8.2
compatible.  Based on the overall setup you describe, you may have a hybrid
installation with GPDB doing the reporting / OLAP workload and the other
Postgres databases handling the customer workloads.

- Luke


On 7/24/07 7:38 AM, "Marc Mamin" <[EMAIL PROTECTED]> wrote:

>  
> Hello,
> 
> thank you for all your comments and recommendations.
> 
> I'm aware that the conditions for this benchmark are not ideal, mostly
> due to the lack of time to prepare it. We will also need an additional
> benchmark on a less powerful - more realistic - server to better
> understand the scability of our application.
> 
> 
> Our application is based on java and is generating dynamic reports from
> log files content. Dynamic means here that a repor will be calculated
> from the postgres data the first time it is requested (it will  then be
> cached). Java is used to drive the data preparation and to
> handle/generate the reports requests.
> 
> This is much more an OLAP system then an OLTP, at least for our
> performance concern.
> 
> 
> 
> 
> Data preparation:
> 
> 1) parsing the log files with a heavy use of perl (regular expressions)
> to generate csv files. Prepared statements also maintain reference
> tables in the DB. Postgres performance is not an issue for this first
> step.
> 
> 2) loading the csv files with COPY. As around 70% of the data to load
> come in a single daily table, we don't allow concurrent jobs for this
> step. We have between a few and a few hundreds files to load into a
> single table; they are processed one after the other. A primary key is
> always defined; for the case when the required indexes are alreay built
> and when the new data are above a given size, we are using a "shadow"
> table  instead (without the indexes) , build the index after the import
> and then replace the live table with the shadow one.
> For example, we a have a table of 13 GB + 11 GB indexes (5 pieces).
> 
> Performances :
> 
>     a) is there an "ideal" size to consider for our csv files (100 x 10
> MB or better 1 x 1GB ?)
>     b) maintenance_work_mem: I'll use around 1 GB as recommended by
> Stefan
>     
> 3) Data agggregation. This is the heaviest part for Postgres. On our
> current system some queries need above one hour, with phases of around
> 100% cpu use, alterning with times of heavy i/o load when temporary
> results are written/read to the plate (pgsql_tmp). During the
> aggregation, other postgres activities are low (at least should be) as
> this should take place at night. Currently we have a locking mechanism
> to avoid having more than one of such queries running concurently. This
> may be to strict for the benchmark server but better reflect our current
> hardware capabilities.
> 
> Performances : Here we should favorise a single huge transaction and
> consider a low probability to have another transaction requiring large
> sort space. Considering this, is it reasonable to define work_mem being
> 3GB (I guess I should raise this parameter dynamically before running
> the aggregation queries)
> 
> 4) Queries (report generation)
> 
> We have only few requests which are not satisfying while requiring large
> sort operations. The data are structured in different aggregation levels
> (minutes, hours, days) with logical time based partitions in oder to
> limit the data size to compute for a given report. Moreover we can scale
> our infrastrucure while using different or dedicated Postgres servers
> for different customers. Smaller customers may share a same instance,
> each of them having its own schema (The lock mechanism for large
> aggregations apply to a whole Postgres instance, not to a single
> customer) . The benchmark will help us to plan such distribution.
> 
> During the benchmark, we will probably not have more than 50 not idle
> connections simultaneously. It is a bit too early for us to fine tune
> this part. The benchmark will mainly focus on the steps 1 to 3
> 
> During the benchmark, the Db will reach a size of about 400 GB,
> simulating 3 different customers, also with data quite equally splitted
> in 3 scheemas.
> 
> 
> 
> I will post our configuration(s) later on.
> 
> 
> 
> Thanks again for all your valuable input.
> 
> Marc Mamin
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Reply via email to