On Mar 9, 2013, at 11:54 AM, Paul Jungwirth wrote:

> Hello,
> 
> I'm running a specialized search engine that indexes a few tens of millions 
> of web pages, keeping everything in Postgres, and one problem I'm starting to 
> see is poor cache hit rates. My database has two or three tables just for the 
> text of the scraped pages, with one row every time a page is scraped and a 
> `text` column for the HTML. These tables are almost-but-not-quite write-only. 
> They are only read by one job, which uses them to create other objects in the 
> system. I'd like the rest of the database to be in-memory all the time, but I 
> don't really care if I have to read these tables from disk. To keep my 
> problem tables from dragging down performance on the rest of the system, I'm 
> thinking of splitting them out. I suppose I could save the HTML outside of 
> Postgres entirely, but I'm wondering if a nicer solution would be to keep 
> these tables in a separate "cluster" (i.e. 
> /var/lib/postgresql/9.1/something_else -- I hope this is the right 
> terminology). Then I could tune that cluster differently from the main 
> cluster, or even put it on a different machine. And I could still use dblink 
> to query both clusters conveniently (I think; this isn't really that 
> important). Does this seem like a worthwhile approach? Is there a better way 
> to deal with a few out-sized tables?

Well, what problem exactly are you trying to solve? Having large tables itself 
isn't a problem, but it often tends to imply other things that might be 
problematic:

- large vacuum or analyze times. (Probably fixable in your case with per-table 
autovacuum thresholds.)
- large disk usage. (Probably fixable in your case with tablespaces.)
- slow selects or updates. (Probably fixable in your case with partitioning.)
- blowing out your buffer cache with useless dirty pages - this is where you 
might want to look into separate servers.

Reply via email to