On Tue, Jul 12, 2011 at 8:33 AM, Conrad Taylor <[email protected]> wrote:

> On Tue, Jul 12, 2011 at 2:02 AM, aupayo <[email protected]> wrote:
>
>> Hi,
>>
>> I want to screen scrape information from some websites (I have
>> permission to do it).
>>
>> I am using the Mechanize plugin. The websites are different from each
>> other, so I need to write a different RoR code to screen scrape each
>> website. There would be hundreds of different websites.
>>
>> Ok, the problem is that I don't know how to implement this in an
>> elegant and efficient way. My current quick and dirty solution is a
>> model that I call when I want to screen scrape a website:
>>
>> I call it like: Spider.crawl(website_id)
>>
>> It looks like:
>>
>> class Spider < ActiveRecord::Base
>>
>>  require 'mechanize'
>>
>>  def crawl(website_id)
>>
>>          if(website_id == 1)
>>                 //Mechanize code for screen scraping website 1
>>          end
>>
>>          if(website_id == 2)
>>                 //Mechanize code for screen scraping website 2
>>          end
>>
>>           .....
>>
>>   end
>>
>> end
>>
>>
>> How can I improve that?
>> Is there at least a way to put the code for each website in an
>> external file, so then I can call just the code I need? That way I
>> would avoid working with a model that has thousands of lines...
>>
>> Thanks for your help!
>>
>>
> Hi, you can define a base class which contains all the common information
> for all your sites.  Then you can define a subclass for easy site that
> inherits from the base class.  For example,
>
> class Site
>
>   attr_accessor :name
>
>   def to_s
>     puts "using #{self.class}#to_s"
>   end
>
>   def crawl
>     puts "using #{self.class}#crawl"
>   end
>
> end
>
> class HerSite < Site
>   def crawl
>     puts "using #{self.class}#crawl version 1"
>   end
> end
>
> class HisSite < Site
>   def crawl
>     puts "using #{self.class}#crawl version 2"
>   end
> end
>
> Next, you can define a SiteFactory class for creating an instance of the
> given class which represents our site.  Thus, this can be represented
> as follows:
>
> class SiteFactory
>
>   def create( site )
>     site.new
>   end
>
> end
>

The above class can be refactored as to the following:

class SiteFactory
  def self.create( site )
    site.new
  end
end


>
> We can define our Spider class that has single class method that takes an
> instance of a site and invokes its crawl instance method.
>
> class Spider
>
>   def self.crawl_site( site )
>     site.crawl
>   end
>
> end
>
> Putting it all together, we can crawl all of our sites by doing the
> following:
>
> site_factory = SiteFactory.new
>
> [ HerSite, HisSite ].each do | klass |
>   site = site_factory.create( klass )
>   Spider.crawl_site( site )
> end
>

Now, we can rewrite our calling routine to the following:

[ HerSite, HisSite ].each do | klass |
  site = SiteFactory.create( klass )
  Spider.crawl_site( site )
end

Enjoy,

-Conrad

ps:  There's always something you missed after you click send.


>
> Finally, anytime you want to add a new site you just create a class that
> inherits from class Site that has a single instance called crawl that
> describes
> its strategy for navigating the site.  There's an easier way to obtain all
> the classes that inherit class Site and I leave this as an exercise for you.
>
> Good luck,
>
> -Conrad
>
>
>> --
>>
>> You received this message because you are subscribed to the Google Groups
>> "Ruby on Rails: Talk" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected].
>> For more options, visit this group at
>> http://groups.google.com/group/rubyonrails-talk?hl=en.
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en.

Reply via email to