On Tue, Jul 12, 2011 at 2:02 AM, aupayo <[email protected]> wrote:

> Hi,
>
> I want to screen scrape information from some websites (I have
> permission to do it).
>
> I am using the Mechanize plugin. The websites are different from each
> other, so I need to write a different RoR code to screen scrape each
> website. There would be hundreds of different websites.
>
> Ok, the problem is that I don't know how to implement this in an
> elegant and efficient way. My current quick and dirty solution is a
> model that I call when I want to screen scrape a website:
>
> I call it like: Spider.crawl(website_id)
>
> It looks like:
>
> class Spider < ActiveRecord::Base
>
>  require 'mechanize'
>
>  def crawl(website_id)
>
>          if(website_id == 1)
>                 //Mechanize code for screen scraping website 1
>          end
>
>          if(website_id == 2)
>                 //Mechanize code for screen scraping website 2
>          end
>
>           .....
>
>   end
>
> end
>
>
> How can I improve that?
> Is there at least a way to put the code for each website in an
> external file, so then I can call just the code I need? That way I
> would avoid working with a model that has thousands of lines...
>
> Thanks for your help!
>
>
Hi, you can define a base class which contains all the common information
for all your sites.  Then you can define a subclass for easy site that
inherits from the base class.  For example,

class Site

  attr_accessor :name

  def to_s
    puts "using #{self.class}#to_s"
  end

  def crawl
    puts "using #{self.class}#crawl"
  end

end

class HerSite < Site
  def crawl
    puts "using #{self.class}#crawl version 1"
  end
end

class HisSite < Site
  def crawl
    puts "using #{self.class}#crawl version 2"
  end
end

Next, you can define a SiteFactory class for creating an instance of the
given class which represents our site.  Thus, this can be represented
as follows:

class SiteFactory

  def create( site )
    site.new
  end

end

We can define our Spider class that has single class method that takes an
instance of a site and invokes its crawl instance method.

class Spider

  def self.crawl_site( site )
    site.crawl
  end

end

Putting it all together, we can crawl all of our sites by doing the
following:

site_factory = SiteFactory.new

[ HerSite, HisSite ].each do | klass |
  site = site_factory.create( klass )
  Spider.crawl_site( site )
end

Finally, anytime you want to add a new site you just create a class that
inherits from class Site that has a single instance called crawl that
describes
its strategy for navigating the site.  There's an easier way to obtain all
the classes that inherit class Site and I leave this as an exercise for you.

Good luck,

-Conrad


> --
> You received this message because you are subscribed to the Google Groups
> "Ruby on Rails: Talk" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/rubyonrails-talk?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en.

Reply via email to