Here's an example that scrapes tables
<https://gist.github.com/jeffreykegler/9316134>. It's adopted from the
synopsis, which is also part of the test suite. As the second test
shows, it's very clever about finding tables, even when they are highly
defective.
-- jeffrey
On 03/02/2014 01:55 PM, [email protected] wrote:
Thank you Jeffrey, you answer my question quite precisely and point to
the right resources :)
However, the examples in Marpa::R2::HTML's**doc show how to "do
surgery"to the DOM, but not how to extract the data afterwards. You
manipulate the DOM to remove some set of elements. But I still need to
retrieve information from this "pruned" DOM afterwards. Well, once
you've extracted all the relevant elements, scraping them is dead
easy. I was wondering whether Marpa::R2::HTML had a simple syntax for
doing this extra step, or whether I would have to use a very simple
scraper afterwards.
One example I would like to see for example, is how to retrieve
all links that are inside any tables, and put them into a list.
I realize that this simple example should better be implemented with
Web::Scraper or Mojo::UserAgent in real life; but it would a good
example of how to do scraping with Marpa. A very simple example to get
users started: copy-paste it, modify it, and add complexity as you go
along.
I have used Web::Scraper and Mojo::UserAgent who both do a very good
job even for fairly complex scenarios. I have a complex project in
mind that might benefit from Marpa-super-power. Let me submit it here:
Comments and ideas are welcome :-) The idea is very early stage, it's
all in my mind, so I might be over-complexifying things at the moment:
I might come up with simpler heuristics when I will start coding. I
won't provide concrete examples as I will keep the use-case idea for
myself for now. Here you go:
I retrieve one piece of information (call it X) from one source, and I
know that this piece of information is present on 20 other websites
which all have radically different layouts. In all of these pages
there is also some other information that I want to scrape (call it Y).
X is a hash with various info, some arrays in it, etc. I have some
regexes for Y, and I know that it will be contained in some tables (or
<div>s) along with the X info. So, I can use the information X to
identify the interesting tables (or <div>s) in each page: they are the
ones where X is displayed. And then I can develop some simple but
flexible heuristic to find the related information Y.
That's the idea anyway...
So basically:
1. Find some elements of X in the page, and use these to decide where
the relevant information must be
2. Develop a flexible heuristic to extract Y from these relevant areas
(it might not be the same heuristic in all websites, but I might be
able to identify only 2 or 3 heuristics, one of which will work in
each page)
The 20 websites have very different DOM structures (some using tables,
some using divs, no CSS in common whatsoever, and some displaying
information horizontally or vertically, and in different orders), and
some will change structure once in a while. So I want to code an
algorithm as general, simple, and maintainable as possible. That's why
I think that Marpa::R2::HTML might be the right candidate. It's not
advertised as a scraper, but the fact that it retrieves the DOM tree
(just like Mojo::UserAgent does) and manipulates it, make me feel like
it could be used as one. Mojo::UserAgent gives methods to race through
the tree and find information. I feel like Marpa's capacity to
manipulate the tree could be a more "high-level" way of extracting
information from it. It feels very exciting but...
I would appreciate help with:
- the simplest scraping example to get me started
- getting feedback from someone who knows the tool: do you feel
like my project could benefit from Marpa? Is my understanding of
Marpa::R2::HTML's capabilities correct?
Thank you for any feedback
Pierre
--
You received this message because you are subscribed to the Google
Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "marpa
parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.