Hey all! Just to preface, I am fairly new to RoR, and brand new to using hpricot.
I am using the following code to scrape this xpath: "/html/body/div/div[5]/div/div[2]/div[2]/div[2]" from this url: "http://www.greatnonprofits.org/" Here is my code to do so (taken from igvita.com's related blogpost): ************* require 'rubygems' require 'open-uri' require 'hpricot' @url = "http://www.greatnonprofits.org/" @response = '' begin # open-uri RDoc: http://stdlib.rubyonrails.org/libdoc/open-uri/rdoc/index.html open(@url, "User-Agent" => "Ruby/#{RUBY_VERSION}", "From" => "[email protected]", "Referer" => "http://www.igvita.com/blog/") { |f| puts "Fetched document: #{f.base_uri}" puts "\t Content Type: #{f.content_type}\n" puts "\t Charset: #{f.charset}\n" puts "\t Content-Encoding: #{f.content_encoding}\n" puts "\t Last Modified: #{f.last_modified}\n\n" # Save the response body @response = f.read } # HPricot RDoc: http://code.whytheluckystiff.net/hpricot/ doc = Hpricot(@response) # Retrieve content puts (doc/"/html/body/div/div[5]/div/div[2]/div[2]/div[2]").to_html () rescue Exception => e print e, "\n" end *************** In my irb terminal, I get this: *************** irb(main):031:0> load 'greatnonprofitsscraper.rb' Fetched document: http://www.greatnonprofits.org/ Content Type: text/html Charset: utf-8 Content-Encoding: Last Modified: Tue Mar 31 23:43:52 -0700 2009 => true *************** Anyone know why this is happening? The code works with other urls/ xpaths. Can anyone specify for me why www.greatnonprofits.com is different? Thanks a million! I am quite frustrated, and I appreciate any help!!! --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---

