Here's an example that scrapes tables <https://gist.github.com/jeffreykegler/9316134>. It's adopted from the synopsis, which is also part of the test suite. As the second test shows, it's very clever about finding tables, even when they are highly defective.

-- jeffrey

On 03/02/2014 01:55 PM, [email protected] wrote:
Thank you Jeffrey, you answer my question quite precisely and point to the right resources :)

However, the examples in Marpa::R2::HTML's**doc show how to "do surgery"to the DOM, but not how to extract the data afterwards. You manipulate the DOM to remove some set of elements. But I still need to retrieve information from this "pruned" DOM afterwards. Well, once you've extracted all the relevant elements, scraping them is dead easy. I was wondering whether Marpa::R2::HTML had a simple syntax for doing this extra step, or whether I would have to use a very simple scraper afterwards.

One example I would like to see for example, is how to retrieve all links that are inside any tables, and put them into a list.

I realize that this simple example should better be implemented with Web::Scraper or Mojo::UserAgent in real life; but it would a good example of how to do scraping with Marpa. A very simple example to get users started: copy-paste it, modify it, and add complexity as you go along.


I have used Web::Scraper and Mojo::UserAgent who both do a very good job even for fairly complex scenarios. I have a complex project in mind that might benefit from Marpa-super-power. Let me submit it here: Comments and ideas are welcome :-) The idea is very early stage, it's all in my mind, so I might be over-complexifying things at the moment: I might come up with simpler heuristics when I will start coding. I won't provide concrete examples as I will keep the use-case idea for myself for now. Here you go:

I retrieve one piece of information (call it X) from one source, and I know that this piece of information is present on 20 other websites which all have radically different layouts. In all of these pages there is also some other information that I want to scrape (call it Y).

X is a hash with various info, some arrays in it, etc. I have some regexes for Y, and I know that it will be contained in some tables (or <div>s) along with the X info. So, I can use the information X to identify the interesting tables (or <div>s) in each page: they are the ones where X is displayed. And then I can develop some simple but flexible heuristic to find the related information Y.
That's the idea anyway...

So basically:
1. Find some elements of X in the page, and use these to decide where the relevant information must be 2. Develop a flexible heuristic to extract Y from these relevant areas (it might not be the same heuristic in all websites, but I might be able to identify only 2 or 3 heuristics, one of which will work in each page)

The 20 websites have very different DOM structures (some using tables, some using divs, no CSS in common whatsoever, and some displaying information horizontally or vertically, and in different orders), and some will change structure once in a while. So I want to code an algorithm as general, simple, and maintainable as possible. That's why I think that Marpa::R2::HTML might be the right candidate. It's not advertised as a scraper, but the fact that it retrieves the DOM tree (just like Mojo::UserAgent does) and manipulates it, make me feel like it could be used as one. Mojo::UserAgent gives methods to race through the tree and find information. I feel like Marpa's capacity to manipulate the tree could be a more "high-level" way of extracting information from it. It feels very exciting but...

I would appreciate help with:
    - the simplest scraping example to get me started
- getting feedback from someone who knows the tool: do you feel like my project could benefit from Marpa? Is my understanding of Marpa::R2::HTML's capabilities correct?

Thank you for any feedback
Pierre
--
You received this message because you are subscribed to the Google Groups "marpa parser" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "marpa 
parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to