Re: [GENERAL] Splitting Postgres into Separate Clusters?

2013-03-11 Thread Ben Chobot
On Mar 9, 2013, at 11:54 AM, Paul Jungwirth wrote:

 Hello,
 
 I'm running a specialized search engine that indexes a few tens of millions 
 of web pages, keeping everything in Postgres, and one problem I'm starting to 
 see is poor cache hit rates. My database has two or three tables just for the 
 text of the scraped pages, with one row every time a page is scraped and a 
 `text` column for the HTML. These tables are almost-but-not-quite write-only. 
 They are only read by one job, which uses them to create other objects in the 
 system. I'd like the rest of the database to be in-memory all the time, but I 
 don't really care if I have to read these tables from disk. To keep my 
 problem tables from dragging down performance on the rest of the system, I'm 
 thinking of splitting them out. I suppose I could save the HTML outside of 
 Postgres entirely, but I'm wondering if a nicer solution would be to keep 
 these tables in a separate cluster (i.e. 
 /var/lib/postgresql/9.1/something_else -- I hope this is the right 
 terminology). Then I could tune that cluster differently from the main 
 cluster, or even put it on a different machine. And I could still use dblink 
 to query both clusters conveniently (I think; this isn't really that 
 important). Does this seem like a worthwhile approach? Is there a better way 
 to deal with a few out-sized tables?

Well, what problem exactly are you trying to solve? Having large tables itself 
isn't a problem, but it often tends to imply other things that might be 
problematic:

- large vacuum or analyze times. (Probably fixable in your case with per-table 
autovacuum thresholds.)
- large disk usage. (Probably fixable in your case with tablespaces.)
- slow selects or updates. (Probably fixable in your case with partitioning.)
- blowing out your buffer cache with useless dirty pages - this is where you 
might want to look into separate servers.

Re: [GENERAL] Splitting Postgres into Separate Clusters?

2013-03-11 Thread Paul Jungwirth
 Well, what problem exactly are you trying to solve?
 Having large tables itself isn't a problem, but it often
 tends to imply other things that might be problematic:

I'm trying to troubleshoot a very low cache hit rate as returned by this query:

SELECT  sum(heap_blks_read) as heap_read,
sum(heap_blks_hit)  as heap_hit,
sum(heap_blks_hit) / (sum(heap_blks_hit) +
sum(heap_blks_read)) as ratio
FROMpg_statio_user_tables;

So I think that's your fourth option:

 - blowing out your buffer cache with useless dirty pages - this is where you 
 might want to look into separate servers.

So it sounds like I'm on the right track. But a separate
cluster/server seems like a drastic solution.

Thanks,
Paul


On Mon, Mar 11, 2013 at 12:17 PM, Ben Chobot be...@silentmedia.com wrote:

 On Mar 9, 2013, at 11:54 AM, Paul Jungwirth wrote:

 Hello,

 I'm running a specialized search engine that indexes a few tens of millions 
 of web pages, keeping everything in Postgres, and one problem I'm starting to 
 see is poor cache hit rates. My database has two or three tables just for the 
 text of the scraped pages, with one row every time a page is scraped and a 
 `text` column for the HTML. These tables are almost-but-not-quite write-only. 
 They are only read by one job, which uses them to create other objects in the 
 system. I'd like the rest of the database to be in-memory all the time, but I 
 don't really care if I have to read these tables from disk. To keep my 
 problem tables from dragging down performance on the rest of the system, I'm 
 thinking of splitting them out. I suppose I could save the HTML outside of 
 Postgres entirely, but I'm wondering if a nicer solution would be to keep 
 these tables in a separate cluster (i.e. 
 /var/lib/postgresql/9.1/something_else -- I hope this is the right 
 terminology). Then I could tune that cluster differently from the main 
 cluster, or even put it on a different machine. And I could still use dblink 
 to query both clusters conveniently (I think; this isn't really that 
 important). Does this seem like a worthwhile approach? Is there a better way 
 to deal with a few out-sized tables?


 Well, what problem exactly are you trying to solve? Having large tables 
 itself isn't a problem, but it often tends to imply other things that might 
 be problematic:

 - large vacuum or analyze times. (Probably fixable in your case with 
 per-table autovacuum thresholds.)
 - large disk usage. (Probably fixable in your case with tablespaces.)
 - slow selects or updates. (Probably fixable in your case with partitioning.)
 - blowing out your buffer cache with useless dirty pages - this is where you 
 might want to look into separate servers.




--
_
Pulchritudo splendor veritatis.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Splitting Postgres into Separate Clusters?

2013-03-09 Thread Paul Jungwirth
Hello,

I'm running a specialized search engine that indexes a few tens of millions
of web pages, keeping everything in Postgres, and one problem I'm starting
to see is poor cache hit rates. My database has two or three tables just
for the text of the scraped pages, with one row every time a page is
scraped and a `text` column for the HTML. These tables are
almost-but-not-quite write-only. They are only read by one job, which uses
them to create other objects in the system. I'd like the rest of the
database to be in-memory all the time, but I don't really care if I have to
read these tables from disk. To keep my problem tables from dragging down
performance on the rest of the system, I'm thinking of splitting them out.
I suppose I could save the HTML outside of Postgres entirely, but I'm
wondering if a nicer solution would be to keep these tables in a separate
cluster (i.e. /var/lib/postgresql/9.1/something_else -- I hope this is
the right terminology). Then I could tune that cluster differently from the
main cluster, or even put it on a different machine. And I could still use
dblink to query both clusters conveniently (I think; this isn't really that
important). Does this seem like a worthwhile approach? Is there a better
way to deal with a few out-sized tables?

Thanks,
Paul

-- 
_
Pulchritudo splendor veritatis.