On Tue, Jul 12, 2011 at 8:33 AM, Conrad Taylor <[email protected]> wrote:
> On Tue, Jul 12, 2011 at 2:02 AM, aupayo <[email protected]> wrote: > >> Hi, >> >> I want to screen scrape information from some websites (I have >> permission to do it). >> >> I am using the Mechanize plugin. The websites are different from each >> other, so I need to write a different RoR code to screen scrape each >> website. There would be hundreds of different websites. >> >> Ok, the problem is that I don't know how to implement this in an >> elegant and efficient way. My current quick and dirty solution is a >> model that I call when I want to screen scrape a website: >> >> I call it like: Spider.crawl(website_id) >> >> It looks like: >> >> class Spider < ActiveRecord::Base >> >> require 'mechanize' >> >> def crawl(website_id) >> >> if(website_id == 1) >> //Mechanize code for screen scraping website 1 >> end >> >> if(website_id == 2) >> //Mechanize code for screen scraping website 2 >> end >> >> ..... >> >> end >> >> end >> >> >> How can I improve that? >> Is there at least a way to put the code for each website in an >> external file, so then I can call just the code I need? That way I >> would avoid working with a model that has thousands of lines... >> >> Thanks for your help! >> >> > Hi, you can define a base class which contains all the common information > for all your sites. Then you can define a subclass for easy site that > inherits from the base class. For example, > > class Site > > attr_accessor :name > > def to_s > puts "using #{self.class}#to_s" > end > > def crawl > puts "using #{self.class}#crawl" > end > > end > > class HerSite < Site > def crawl > puts "using #{self.class}#crawl version 1" > end > end > > class HisSite < Site > def crawl > puts "using #{self.class}#crawl version 2" > end > end > > Next, you can define a SiteFactory class for creating an instance of the > given class which represents our site. Thus, this can be represented > as follows: > > class SiteFactory > > def create( site ) > site.new > end > > end > The above class can be refactored as to the following: class SiteFactory def self.create( site ) site.new end end > > We can define our Spider class that has single class method that takes an > instance of a site and invokes its crawl instance method. > > class Spider > > def self.crawl_site( site ) > site.crawl > end > > end > > Putting it all together, we can crawl all of our sites by doing the > following: > > site_factory = SiteFactory.new > > [ HerSite, HisSite ].each do | klass | > site = site_factory.create( klass ) > Spider.crawl_site( site ) > end > Now, we can rewrite our calling routine to the following: [ HerSite, HisSite ].each do | klass | site = SiteFactory.create( klass ) Spider.crawl_site( site ) end Enjoy, -Conrad ps: There's always something you missed after you click send. > > Finally, anytime you want to add a new site you just create a class that > inherits from class Site that has a single instance called crawl that > describes > its strategy for navigating the site. There's an easier way to obtain all > the classes that inherit class Site and I leave this as an exercise for you. > > Good luck, > > -Conrad > > >> -- >> >> You received this message because you are subscribed to the Google Groups >> "Ruby on Rails: Talk" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/rubyonrails-talk?hl=en. >> >> > -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.

