At which point you're get stuck?

Simply GET index page, parse it via nokogiri, select <a> tags which you
interested in, extract urls from href attribute, do recursive GET on these
urls.
Each page type should have its own function that performs GET and parsing.

If you have to fetch pretty huge amount of pages, then you need to store
your grabbing state somewhere in database. For example, keep separate table
for urls to be parsed. (url is a unique key), and mark rows a "to be
parsed" and "already parsed". Of course you need to normalize all urls, not
avoid duplicates in table.

Да и мог бы спросить в ror2ru.

On Tue, May 12, 2015 at 7:42 AM, Роман Ярыгин <[email protected]> wrote:

> Hello!
>
> I need to grab all site data with all tree structure. Every page have
> links to children pages. How to build site tree with Nokogiri? It must be
> recursive page visiting and scraping all directory links, but I can't
> recognize full algorhytm. How to do that?
> P.S. And I don't need to "Save all site on disk with HTTRack". Data will
> be processed and copied on the new version of redesigned original site.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Ruby on Rails: Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/rubyonrails-talk/db39c272-d353-42be-ae09-4a09fcf4abca%40googlegroups.com
> <https://groups.google.com/d/msgid/rubyonrails-talk/db39c272-d353-42be-ae09-4a09fcf4abca%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/rubyonrails-talk/CAP1h_xcp%2BTqVNe8b_zy%3Da1BQX4cAT0DCz5Rx%3D8gh-V9Afk7Eyw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to