On Mar 9, 2013, at 11:54 AM, Paul Jungwirth wrote: > Hello, > > I'm running a specialized search engine that indexes a few tens of millions > of web pages, keeping everything in Postgres, and one problem I'm starting to > see is poor cache hit rates. My database has two or three tables just for the > text of the scraped pages, with one row every time a page is scraped and a > `text` column for the HTML. These tables are almost-but-not-quite write-only. > They are only read by one job, which uses them to create other objects in the > system. I'd like the rest of the database to be in-memory all the time, but I > don't really care if I have to read these tables from disk. To keep my > problem tables from dragging down performance on the rest of the system, I'm > thinking of splitting them out. I suppose I could save the HTML outside of > Postgres entirely, but I'm wondering if a nicer solution would be to keep > these tables in a separate "cluster" (i.e. > /var/lib/postgresql/9.1/something_else -- I hope this is the right > terminology). Then I could tune that cluster differently from the main > cluster, or even put it on a different machine. And I could still use dblink > to query both clusters conveniently (I think; this isn't really that > important). Does this seem like a worthwhile approach? Is there a better way > to deal with a few out-sized tables?
Well, what problem exactly are you trying to solve? Having large tables itself isn't a problem, but it often tends to imply other things that might be problematic: - large vacuum or analyze times. (Probably fixable in your case with per-table autovacuum thresholds.) - large disk usage. (Probably fixable in your case with tablespaces.) - slow selects or updates. (Probably fixable in your case with partitioning.) - blowing out your buffer cache with useless dirty pages - this is where you might want to look into separate servers.