Re: [Rails] How to scrape a page without knowing its html structure

Hassan Schroeder Sat, 12 Dec 2009 09:20:50 -0800

On Sat, Dec 12, 2009 at 2:56 AM, kalyan <[email protected]> wrote:


> I'm doing one module in my site, there I need to import user blog into
> my site. I can use RSS feeds to read the blog information but using
> RSS feeds I'm not getting entire information. So, I need to scrape the
> user blog page. How to scrape a pages without knowing its html
> structure of a page?

Unless you want the entire page, you need to know something about
the page structure.

Well. If the page is even reasonably marked up (DIVs/Ps-wise) and
you create an array of block elements, you *might* get away with the
assumption that the ones with significant amounts of text (for some
value of "significant") are the actual blog post.

Might. I'd imagine a lot more going into that heuristic, since you're
looking for an AI solution  :-)

Good luck,
-- 
Hassan Schroeder ------------------------ [email protected]
twitter: @hassan

--

You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en.

Re: [Rails] How to scrape a page without knowing its html structure

Reply via email to