Nick Shanny (Brother of above person)
On Apr 22, 2004, at 7:30 PM, Sean Shanny wrote:
I should have included this as well:
show all; name | setting --------------------------------+---------------- add_missing_from | on australian_timezones | off authentication_timeout | 60 check_function_bodies | on checkpoint_segments | 64 checkpoint_timeout | 30 checkpoint_warning | 30 client_encoding | UNICODE client_min_messages | notice commit_delay | 0 commit_siblings | 5 cpu_index_tuple_cost | 0.001 cpu_operator_cost | 0.0025 cpu_tuple_cost | 0.01 DateStyle | ISO, MDY db_user_namespace | off deadlock_timeout | 1000 debug_pretty_print | off debug_print_parse | off debug_print_plan | off debug_print_rewritten | off default_statistics_target | 1000 default_transaction_isolation | read committed default_transaction_read_only | off dynamic_library_path | $libdir effective_cache_size | 400000 enable_hashagg | on enable_hashjoin | on enable_indexscan | on enable_mergejoin | on enable_nestloop | on enable_seqscan | on enable_sort | on enable_tidscan | on explain_pretty_print | on extra_float_digits | 0 from_collapse_limit | 8 fsync | on geqo | on geqo_effort | 1 geqo_generations | 0 geqo_pool_size | 0 geqo_selection_bias | 2 geqo_threshold | 11 join_collapse_limit | 8 krb_server_keyfile | unset lc_collate | C lc_ctype | C lc_messages | C lc_monetary | C lc_numeric | C lc_time | C log_connections | off log_duration | off log_error_verbosity | default log_executor_stats | off log_hostname | off log_min_duration_statement | -1 log_min_error_statement | panic log_min_messages | notice log_parser_stats | off log_pid | off log_planner_stats | off log_source_port | off log_statement | off log_statement_stats | off log_timestamp | on max_connections | 100 max_expr_depth | 10000 max_files_per_process | 1000 max_fsm_pages | 20000 max_fsm_relations | 1000 max_locks_per_transaction | 64 password_encryption | on port | 5432 pre_auth_delay | 0 preload_libraries | unset random_page_cost | 4 regex_flavor | advanced rendezvous_name | unset search_path | $user,public server_encoding | UNICODE server_version | 7.4.1 shared_buffers | 4000 silent_mode | off sort_mem | 64000 sql_inheritance | on ssl | off statement_timeout | 0 stats_block_level | on stats_command_string | on stats_reset_on_server_start | off stats_row_level | on stats_start_collector | on superuser_reserved_connections | 2 syslog | 0 syslog_facility | LOCAL0 syslog_ident | postgres tcpip_socket | on TimeZone | unknown trace_notify | off transaction_isolation | read committed transaction_read_only | off transform_null_equals | off unix_socket_directory | unset unix_socket_group | unset unix_socket_permissions | 511 vacuum_mem | 64000 virtual_host | unset wal_buffers | 1024 wal_debug | 0 wal_sync_method | open_sync zero_damaged_pages | off
Sean Shanny wrote:
To all,
Essentials: Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB memory, direct attached via fibre channel to a fully optioned 3.5TB XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50.
Background: We are loading what are essentially xml based access logs from about 20+ webservers daily, about 6GB of raw data. We have a classic star schema. All the ETL tools are custom java code or standard *nix tools like sort, uniq etc...
The problem: We have about 46 million rows in a table with the following schema:
Table "public.d_referral" Column | Type | Modifiers --------------------+---------+----------- id | integer | not null referral_raw_url | text | not null job_control_number | integer | not null Indexes: "d_referral_pkey" primary key, btree (id) "idx_referral_url" btree (referral_raw_url)
This is one of our dimension tables. Part of the daily ETL process is to match all the new referral URL's against existing data in the d_referral table. Some of the values in referral_raw_url can be 5000 characters long :-( . The avg length is : 109.57 characters.
I sort and uniq all the incoming referrals and load them into a temp table.
Table "public.referral_temp" Column | Type | Modifiers --------+------+----------- url | text | not null Indexes: "referral_temp_pkey" primary key, btree (url)
I then do a left join
SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id
This is the output from an explain analyze (Please note that I do a set enable_index_scan = false prior to issuing this because it takes forever using indexes.):
explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;
QUERY PLAN
---------------------------------------------------------------------- ---------------------------------------------------------------------- ----------
Sort (cost=4012064.81..4013194.45 rows=451856 width=115) (actual time=1297320.823..1297739.813 rows=476176 loops=1)
Sort Key: t1.id
-> Hash Left Join (cost=1052345.95..3969623.10 rows=451856 width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1)
Hash Cond: ("outer".url = "inner".referral_raw_url)
-> Seq Scan on referral_temp t2 (cost=0.00..6645.56 rows=451856 width=111) (actual time=20.285..1449.634 rows=476176 loops=1)
-> Hash (cost=729338.16..729338.16 rows=46034716 width=124) (actual time=1146440.710..1146440.710 rows=0 loops=1)
-> Seq Scan on d_referral t1 (cost=0.00..729338.16 rows=46034716 width=124) (actual time=14.502..-1064277.123 rows=46034715 loops=1)
Total runtime: 1298153.193 ms
(8 rows)
What I would like to know is if there are better ways to do the join? I need to get all the rows back from the referral_temp table as they are used for assigning FK's for the fact table later in processing. When I iterate over the values that I get back those with t1.id = null I assign a new FK and push both into the d_referral table as new entries as well as a text file for later use. The matching records are written to a text file for later use.
If we cannot improve the join performance my question becomes are there better tools to match up the 46 million and growing at the rate of 1 million every 3 days, strings outside of postgresql? We don't want to have to invest in zillions of dollars worth of hardware but if we have to we will. I just want to make sure we have all the non hardware possibilities for improvement covered before we start investing in large disk arrays.
Thanks.
--sean
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster