Thank you Jeffrey, you answer my question quite precisely and point to the
right resources :)
However, the examples in Marpa::R2::HTML's doc show how to "do surgery"to
the DOM, but not how to extract the data afterwards. You manipulate the DOM
to remove some set of elements. But I still need to retrieve information
from this "pruned" DOM afterwards. Well, once you've extracted all the
relevant elements, scraping them is dead easy. I was wondering whether
Marpa::R2::HTML had a simple syntax for doing this extra step, or whether I
would have to use a very simple scraper afterwards.
One example I would like to see for example, is how to retrieve all
links that are inside any tables, and put them into a list.
I realize that this simple example should better be implemented with
Web::Scraper or Mojo::UserAgent in real life; but it would a good example
of how to do scraping with Marpa. A very simple example to get users
started: copy-paste it, modify it, and add complexity as you go along.
I have used Web::Scraper and Mojo::UserAgent who both do a very good job
even for fairly complex scenarios. I have a complex project in mind that
might benefit from Marpa-super-power. Let me submit it here: Comments and
ideas are welcome :-) The idea is very early stage, it's all in my mind, so
I might be over-complexifying things at the moment: I might come up with
simpler heuristics when I will start coding. I won't provide concrete
examples as I will keep the use-case idea for myself for now. Here you go:
I retrieve one piece of information (call it X) from one source, and I know
that this piece of information is present on 20 other websites which all
have radically different layouts. In all of these pages there is also some
other information that I want to scrape (call it Y).
X is a hash with various info, some arrays in it, etc. I have some regexes
for Y, and I know that it will be contained in some tables (or <div>s)
along with the X info. So, I can use the information X to identify the
interesting tables (or <div>s) in each page: they are the ones where X is
displayed. And then I can develop some simple but flexible heuristic to
find the related information Y.
That's the idea anyway...
So basically:
1. Find some elements of X in the page, and use these to decide where the
relevant information must be
2. Develop a flexible heuristic to extract Y from these relevant areas (it
might not be the same heuristic in all websites, but I might be able to
identify only 2 or 3 heuristics, one of which will work in each page)
The 20 websites have very different DOM structures (some using tables, some
using divs, no CSS in common whatsoever, and some displaying information
horizontally or vertically, and in different orders), and some will change
structure once in a while. So I want to code an algorithm as general,
simple, and maintainable as possible. That's why I think that
Marpa::R2::HTML might be the right candidate. It's not advertised as a
scraper, but the fact that it retrieves the DOM tree (just like
Mojo::UserAgent does) and manipulates it, make me feel like it could be
used as one. Mojo::UserAgent gives methods to race through the tree and
find information. I feel like Marpa's capacity to manipulate the tree could
be a more "high-level" way of extracting information from it. It feels very
exciting but...
I would appreciate help with:
- the simplest scraping example to get me started
- getting feedback from someone who knows the tool: do you feel like my
project could benefit from Marpa? Is my understanding of Marpa::R2::HTML's
capabilities correct?
Thank you for any feedback
Pierre
--
You received this message because you are subscribed to the Google Groups
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.