On Sat, Oct 2, 2010 at 9:03 PM, Russell Dias <rus...@gmail.com> wrote:
> I'm currently stuck on a little problem. I'm using cURL in conjunction > with DOMDocument and Xpath to scrape data from a couple of websites. > Please note that is only for personal and educational purposes. > > Right now I have 5 independent scripts (that traverse through 5 > websites) that run via a cron tab every 12 hours. However, as you may > have guessed this is a scalability nightmare. If my list of websites > to scrape grows I have to create another independent script and run it > via cron. > > My knowledge of OOP is fairly basic as I have just gotten started with > it. However, could anyone perhaps suggest a design pattern that would > suit my needs? My solution would be to create an abstract class for > the web crawler and then simply extend it per website I add on. > However, as I said my experience with OOP is almost non-existant > therefore I have no idea how this would scale. I want this 'crawler' > to be one application which can run via one cron rather than having n > amount of scripts for each websites and having to manually create a > cron each time. > > Or does anyone have any experience with this sort thing and could > maybe offer some advice? > > I'm not limited to using PHP either, however due to hosting > constraints Python would most likely be my only other alternative. > > Any help would be appreciated. > > Cheers, > Russell > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > > Are the sites that you are crawling so different as to justify maintaining separate chunks of code for each one? I would try to avoid having any code specific to a site, otherwise scaling your application to support even a hundred sites would involve overlapping hundreds of points of functionality and be a logistical nightmare. Unless you're simply wanting to do this for educational reasons... My suggestion would be to attempt to create an application that can craw all the sites, without specifics for each one. You could fire it with a single cron job, and give it a list of the urls you want it to hit. It can crawl one url, record the findings, move to the next, repeat. Chris.