On Dec 5, 2011, at 1:05 PM, JavierQQ wrote:
> HI,
>
> I want to grab some information about university names, and I found
> this term called "web scraping"
> I search about it in google, and there are tools in ruby.
> One of them is nokogiri but I'm a bit confused because it seems that
> it only gets information that its already in an html or xml
Yes, Nokogiri is a toolkit for (among lots of other things) running Xpath or
CSS queries against a text file. That text file can be anything -- an io stream
of one sort or another with textual data in it will do.
>
> I found a webpage that have a list of university names as a
>
> <select> </select> (html label)
>
> and I want to grab that information
>
> The question is... can I do that with nokogiri or another tool?
> The list is like a country list, but with the names of the
> universities of my country.
A select can be traversed like any other DOM object, this should be fairly
close:
#given doc is a Nokogiri::XML or Nokogiri::HTML nodeset
doc.css('#yourPickerId option').each do |opt|
foo = opt['value']
#whatever else you want to do with foo here
end
>
> It seems that it get that information from an DB using ajax, and what
> I'm trying to do may not be legal or possible
If it's Ajax, you'll need to run a JavaScript interpreter against it. Rails 3.1
shows the way to do that server-side. Once you have munged the page into a text
stream that includes this desired data (flattened it down to the result of the
Ajax plus the base code) then Nokogiri or Hpricot or any other XML/HTML parser
could rip through that DOM and give you individual nodes to play with.
>
> I'll really appreciate if someone can help me to understand what this
> tool is used for, and if what I'm trying to do is possible
Possible, sure. It's never entirely clear why someone would run an Ajax request
to populate a page. They may have done it to keep the scrapers out (like you),
or they may have done it to isolate and accelerate a laggy part of the initial
page load. If the latter (so they aren't actually discouraging you -- did you
ask them if you could do this?) then you might also want to look into loading
the endpoint of that Ajax request instead of the surrounding page, as that
would eliminate the whole JavaScript abstraction entirely. You'd have one HTTP
request, and unless that endpoint was kinked to only accept requests from
within its own domain, you would likely have JSON or some other structured data
in return, and that could be even easier to interpret in your application.
Walter
>
> Thanks
>
> Javier Q
>
> --
> You received this message because you are subscribed to the Google Groups
> "Ruby on Rails: Talk" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/rubyonrails-talk?hl=en.
>
--
You received this message because you are subscribed to the Google Groups "Ruby
on Rails: Talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.