On Dec 5, 2011, at 1:05 PM, JavierQQ wrote:

> HI,
> 
> I want to grab some information about university names, and I found
> this term called "web scraping"
> I search about it in google, and there are tools in ruby.
> One of them is nokogiri but I'm a bit confused because it seems that
> it only gets information that its already in an html or xml

Yes, Nokogiri is a toolkit for (among lots of other things) running Xpath or 
CSS queries against a text file. That text file can be anything -- an io stream 
of one sort or another with textual data in it will do.

> 
> I found a webpage that have a list of university names as a
> 
> <select> </select> (html label)
> 
> and I want to grab that information
> 
> The question is... can I do that with nokogiri or another tool?
> The list is like a country list, but with the names of the
> universities of my country.

A select can be traversed like any other DOM object, this should be fairly 
close:

#given doc is a Nokogiri::XML or Nokogiri::HTML nodeset
doc.css('#yourPickerId option').each do |opt|
        foo = opt['value']
        #whatever else you want to do with foo here
end

> 
> It seems that it get that information from an DB using ajax, and what
> I'm trying to do may not be legal or possible

If it's Ajax, you'll need to run a JavaScript interpreter against it. Rails 3.1 
shows the way to do that server-side. Once you have munged the page into a text 
stream that includes this desired data (flattened it down to the result of the 
Ajax plus the base code) then Nokogiri or Hpricot or any other XML/HTML parser 
could rip through that DOM and give you individual nodes to play with.

> 
> I'll really appreciate if someone can help me to understand what this
> tool is used for, and if what I'm trying to do is possible

Possible, sure. It's never entirely clear why someone would run an Ajax request 
to populate a page. They may have done it to keep the scrapers out (like you), 
or they may have done it to isolate and accelerate a laggy part of the initial 
page load. If the latter (so they aren't actually discouraging you -- did you 
ask them if you could do this?) then you might also want to look into loading 
the endpoint of that Ajax request instead of the surrounding page, as that 
would eliminate the whole JavaScript abstraction entirely. You'd have one HTTP 
request, and unless that endpoint was kinked to only accept requests from 
within its own domain, you would likely have JSON or some other structured data 
in return, and that could be even easier to interpret in your application.

Walter

> 
> Thanks
> 
> Javier Q
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Ruby on Rails: Talk" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/rubyonrails-talk?hl=en.
> 

-- 
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en.

Reply via email to