Re: A simple example of HTML scraping?

mascip Sun, 02 Mar 2014 13:58:50 -0800

Thank you Jeffrey, you answer my question quite precisely and point to the 
right resources :)


However, the examples in Marpa::R2::HTML's doc show how to "do surgery"to 
the DOM, but not how to extract the data afterwards. You manipulate the DOM 
to remove some set of elements. But I still need to retrieve information 
from this "pruned" DOM afterwards. Well, once you've extracted all the 
relevant elements, scraping them is dead easy. I was wondering whether 
Marpa::R2::HTML had a simple syntax for doing this extra step, or whether I 
would have to use a very simple scraper afterwards.

      One example I would like to see for example, is how to retrieve all 
links that are inside any tables, and put them into a list. 

I realize that this simple example should better be implemented with 
Web::Scraper or Mojo::UserAgent in real life; but it would a good example 
of how to do scraping with Marpa. A very simple example to get users 
started: copy-paste it, modify it, and add complexity as you go along.


I have used Web::Scraper and Mojo::UserAgent who both do a very good job 
even for fairly complex scenarios. I have a complex project in mind that 
might benefit from Marpa-super-power. Let me submit it here: Comments and 
ideas are welcome :-) The idea is very early stage, it's all in my mind, so 
I might be over-complexifying things at the moment: I might come up with 
simpler heuristics when I will start coding. I won't provide concrete 
examples as I will keep the use-case idea for myself for now. Here you go:

I retrieve one piece of information (call it X) from one source, and I know 
that this piece of information is present on 20 other websites which all 
have radically different layouts. In all of these pages there is also some 
other information that I want to scrape (call it Y).

X is a hash with various info, some arrays in it, etc. I have some regexes 
for Y, and I know that it will be contained in some tables (or <div>s) 
along with the X info. So, I can use the information X to identify the 
interesting tables (or <div>s) in each page: they are the ones where X is 
displayed. And then I can develop some simple but flexible heuristic to 
find the related information Y. 
That's the idea anyway...

So basically:
1. Find some elements of X in the page, and use these to decide where the 
relevant information must be
2. Develop a flexible heuristic to extract Y from these relevant areas (it 
might not be the same heuristic in all websites, but I might be able to 
identify only 2 or 3 heuristics, one of which will work in each page)

The 20 websites have very different DOM structures (some using tables, some 
using divs, no CSS in common whatsoever, and some displaying information 
horizontally or vertically, and in different orders), and some will change 
structure once in a while. So I want to code an algorithm as general, 
simple, and maintainable as possible. That's why I think that 
Marpa::R2::HTML might be the right candidate. It's not advertised as a 
scraper, but the fact that it retrieves the DOM tree (just like 
Mojo::UserAgent does) and manipulates it, make me feel like it could be 
used as one. Mojo::UserAgent gives methods to race through the tree and 
find information. I feel like Marpa's capacity to manipulate the tree could 
be a more "high-level" way of extracting information from it. It feels very 
exciting but...

I would appreciate help with:
    - the simplest scraping example to get me started
    - getting feedback from someone who knows the tool: do you feel like my 
project could benefit from Marpa? Is my understanding of Marpa::R2::HTML's 
capabilities correct?

Thank you for any feedback
Pierre

-- 
You received this message because you are subscribed to the Google Groups 
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: A simple example of HTML scraping?

Reply via email to