Okay guys,

Thanks for all the great help and advice already! Let me just clear some 
things, to make my question a little easier to answer :-)
Now my site is a search engine for used cars - not just a car shop with a few 
hundred cars.
The update query you look at, is an update that is executed once a day in 
chunks for all active adverts, so we know they are still for sale (one car can 
be advertised at several places hence several "adverts"). So it's not a 
"constant stream" but it has a fairly high volume especially at night time 
though.

A compressed version of my .conf looks like this (note: there is some tweaks at 
the end of the file)
  data_directory = '/var/lib/postgresql/9.2/main'
  hba_file = '/etc/postgresql/9.2/main/pg_hba.conf' 
  ident_file = '/etc/postgresql/9.2/main/pg_ident.conf'
  external_pid_file = '/var/run/postgresql/9.2-main.pid' 
  listen_addresses = '192.168.0.2, localhost'
  port = 5432
  max_connections = 1000 
  unix_socket_directory = '/var/run/postgresql'
  wal_level = hot_standby
  synchronous_commit = off
  archive_mode = onarchive_command = 'rsync -a %p 
postgres@192.168.0.4:/var/lib/postgresql/9.2/wals/%f </dev/null' 
  max_wal_senders = 1 
  wal_keep_segments = 32
  logging_collector = on 
  log_min_messages = debug1 
  log_min_error_statement = debug1
  log_min_duration_statement = 0
  log_checkpoints = on
  log_connections = on
  log_disconnections = onlog_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d '
  log_lock_waits = on log_temp_files = 0
  datestyle = 'iso, mdy' 
  lc_messages = 'C'
  lc_monetary = 'en_US.UTF-8'
  lc_numeric = 'en_US.UTF-8' 
  lc_time = 'en_US.UTF-8' 
  default_text_search_config = 'pg_catalog.english' 
  default_statistics_target = 100
  maintenance_work_mem = 1GB
  checkpoint_completion_target = 0.7
  effective_cache_size = 22GB
  work_mem = 160MB
  wal_buffers = 4MB
  checkpoint_segments = 16
  shared_buffers = 7680MB

# All the log stuff is mainly temporary requirement for pgBadger
# The database has been tuned with pgtuner

You might be familiar with new relic, and I use that for quite a lot of 
monitoring. So, this is what I see at night time (a lot of I/O). So I went to 
play around with pgBadger to get some insights at database level.
<iframe src="https://rpm.newrelic.com/public/charts/h2dtedghfsv"; width="500" 
height="300" scrolling="no" frameborder="no"></iframe>

This shows me, that the by far most time-consuming queries are updates (in 
general). On avg. a query like the one I showed you, take 1,3 sec (but often it 
takes several minutes - which makes me wonder). So correct me if I'm wrong 
here: my theory is, that I have too many too slow update queries, that then 
often end up in a situation, where they "wait" for each other to finish, hence 
the sometimes VERY long execution times. So my basic idea here is, that if I 
could reduce the cost of the updates, then I could get a hight throughput 
overall.

Here is a sample of the pgBadger analysis:

Queries that took up the most time (N) ^
Rank    Total duration  Times executed  Av. duration (s)        Query
1       1d15h28m38.71s  
948,711
0.15s   
COMMIT;

2       1d2h17m55.43s   
401,002
0.24s   
INSERT INTO "car_images" ( "car_id", "created_at", "image", "updated_at" ) 
VALUES ( '', '', '', '' ) returning "id";

3       23h18m33.68s    
195,093
0.43s   
SELECT DISTINCT "cars".id FROM "cars" LEFT OUTER JOIN "adverts" ON 
"adverts"."car_id" = "cars"."id" LEFT OUTERJOIN "sellers" ON "sellers"."id" = 
"adverts"."seller_id" WHERE "cars"."sales_state" = '' AND "cars"."year" = 0 
AND"cars"."engine_size" = 0.0 AND ( ( "cars"."id" IS NOT NULL AND 
"cars"."brand" = '' AND "cars"."model_name" = ''AND "cars"."fuel" = '' AND 
"cars"."km" = 0 AND "cars"."price" = 0 AND "sellers"."kind" = '' ) ) LIMIT 0;

4       22h45m26.52s    
3,374,133
0.02s   
SELECT "adverts".* FROM "adverts" WHERE ( source_name = '' AND md5 ( url ) = 
md5 ( '' ) ) LIMIT 0;

5       10h31m37.18s    
29,671
1.28s   
UPDATE "adverts" SET "last_observed_at" = '', "data_source_id" = '' WHERE 
"adverts"."id" IN ( ... ) ;

6       7h18m40.65s     
396,393
0.07s   
UPDATE "cars" SET "updated_at" = '' WHERE "cars"."id" = 0;

7       7h6m7.87s       
241,294
0.11s   
UPDATE "cars" SET "images_count" = COALESCE ( "images_count", 0 ) + 0 WHERE 
"cars"."id" = 0;

8       6h56m11.78s     
84,571
0.30s   
INSERT INTO "failed_adverts" ( "active_record_object_class", 
"advert_candidate", "created_at", "exception_class","exception_message", 
"from_rescraper", "last_retried_at", "retry_count", "source_name", 
"stack_trace","updated_at", "url" ) VALUES ( NULL, '', '', '', '', NULL, NULL, 
'', '', '', '', '' ) returning "id";

9       5h47m25.45s     
188,402
0.11s   
INSERT INTO "adverts" ( "availability_state", "car_id", "created_at", 
"data_source_id", "deactivated_at","first_extraction", "last_observed_at", 
"price", "seller_id", "source_id", "source_name", "updated_at", "url" )VALUES ( 
'', '', '', '', NULL, '', '', '', '', '', '', '', '' ) returning "id";

10      3h4m26.86s      
166,235
0.07s   
UPDATE "adverts" SET "deactivated_at" = '', "availability_state" = '', 
"updated_at" = '' WHERE "adverts"."id" = 0;

(Yes I'm already on the task of improving the selects)

Den 28/11/2012 kl. 16.11 skrev Willem Leenen <willem_lee...@hotmail.com>:

> 
> I assume that SQL databases ( Banks? Telecom?) can handle an used car shop. 
> No need for an unstructured data tool.
> 
> 
> 
> > +1, sql databases has limited number of inserts/updates per second. Even
> > with highend hardware you won't have more than XXX operations per
> > second. As Thomas said, you should feed something like nosql database
> > from www server and use other tool to do aggregation and batch inserts
> > to postgresql. It will scale much better.
> > 
> > Marcin

Reply via email to