> > Our approach is that "money" should never be an issue. If you can "prove" > us that polling is cheaper for you, then : 1- we'll match that, 2- we want > to learn how :) >
I'm currently doing it for 250K feeds on a Unix box with 8GB RAM. I use the SimplePie PHP libraray to parse the feeds, and it does a decent job of isolating me from the odds of each format. I have a php file that picks a few hundred feeds at a time from the db and fetches them in a loop. And I run this php file from a cron job that runs every 10 minutes. But I run 50 instances of that file at a time (my crontab file has 50 copies of the line that runs that php file). That's my poor man's approach to multi-threading, but it works. Each feed has an integer ID, and I use the mod of the ID and the MOD of the time in a clever equation to decide which feeds to pull at each point in time. This allows me to not have to keep track of the last time a feed was pulled, which saves quite a bit of db access. It's not a very sophisticated setup, but it works. Effectively, the ongoing cost is the cost of renting an 8GB box (about $200/month). I'm reaching the limits of what I can do on one box, though. Not because of the polling process itself, but mostly due to the size of the data and the disk-swapping that accompanies that. So I can either get a bigger box, shard my data, or move to the app engine. I prefer the app engine because I've had good experience scaling other projects on it in the recent past, and because I want to solve the scaling problem once and for all rather than delay it.
