Re: A simple example of HTML scraping?

Jeffrey Kegler Sun, 02 Mar 2014 16:16:44 -0800

Here's an example that scrapes tables<https://gist.github.com/jeffreykegler/9316134>. It's adopted from thesynopsis, which is also part of the test suite. As the second testshows, it's very clever about finding tables, even when they are highlydefective.


-- jeffrey


On 03/02/2014 01:55 PM, [email protected] wrote:

Thank you Jeffrey, you answer my question quite precisely and point tothe right resources :)
However, the examples in Marpa::R2::HTML's**doc show how to "dosurgery"to the DOM, but not how to extract the data afterwards. Youmanipulate the DOM to remove some set of elements. But I still need toretrieve information from this "pruned" DOM afterwards. Well, onceyou've extracted all the relevant elements, scraping them is deadeasy. I was wondering whether Marpa::R2::HTML had a simple syntax fordoing this extra step, or whether I would have to use a very simplescraper afterwards.
One example I would like to see for example, is how to retrieveall links that are inside any tables, and put them into a list.
I realize that this simple example should better be implemented withWeb::Scraper or Mojo::UserAgent in real life; but it would a goodexample of how to do scraping with Marpa. A very simple example to getusers started: copy-paste it, modify it, and add complexity as you goalong.
I have used Web::Scraper and Mojo::UserAgent who both do a very goodjob even for fairly complex scenarios. I have a complex project inmind that might benefit from Marpa-super-power. Let me submit it here:Comments and ideas are welcome :-) The idea is very early stage, it'sall in my mind, so I might be over-complexifying things at the moment:I might come up with simpler heuristics when I will start coding. Iwon't provide concrete examples as I will keep the use-case idea formyself for now. Here you go:
I retrieve one piece of information (call it X) from one source, and Iknow that this piece of information is present on 20 other websiteswhich all have radically different layouts. In all of these pagesthere is also some other information that I want to scrape (call it Y).
X is a hash with various info, some arrays in it, etc. I have someregexes for Y, and I know that it will be contained in some tables (or<div>s) along with the X info. So, I can use the information X toidentify the interesting tables (or <div>s) in each page: they are theones where X is displayed. And then I can develop some simple butflexible heuristic to find the related information Y.
That's the idea anyway...

So basically:
1. Find some elements of X in the page, and use these to decide wherethe relevant information must be2. Develop a flexible heuristic to extract Y from these relevant areas(it might not be the same heuristic in all websites, but I might beable to identify only 2 or 3 heuristics, one of which will work ineach page)
The 20 websites have very different DOM structures (some using tables,some using divs, no CSS in common whatsoever, and some displayinginformation horizontally or vertically, and in different orders), andsome will change structure once in a while. So I want to code analgorithm as general, simple, and maintainable as possible. That's whyI think that Marpa::R2::HTML might be the right candidate. It's notadvertised as a scraper, but the fact that it retrieves the DOM tree(just like Mojo::UserAgent does) and manipulates it, make me feel likeit could be used as one. Mojo::UserAgent gives methods to race throughthe tree and find information. I feel like Marpa's capacity tomanipulate the tree could be a more "high-level" way of extractinginformation from it. It feels very exciting but...
I would appreciate help with:
    - the simplest scraping example to get me started
- getting feedback from someone who knows the tool: do you feellike my project could benefit from Marpa? Is my understanding ofMarpa::R2::HTML's capabilities correct?
Thank you for any feedback
Pierre
--
You received this message because you are subscribed to the GoogleGroups "marpa parser" group.To unsubscribe from this group and stop receiving emails from it, sendan email to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


--
You received this message because you are subscribed to the Google Groups "marpa 
parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: A simple example of HTML scraping?

Reply via email to