TWIMC -
WWW:: Scraper is a module for scraping data from various web-based search
engines.
This module has lived in CPAN for a couple of years as
WWW::Search::Scraper. Like the WWW::Search version, WWW::Scraper does the
following;
1. Sends query to the target search engine.
2. Scans the resultant list pages, extracting data from the HTML and
delivering it as discrete fields in multiple response objects.
3. "Backends" customized to each search engine (e.g., Google,
NorthernLight) are written in Perl, using whatever modules and methods the
backend's author chooses to use to parse the result list HTML.
Beyond the WWW::Search version, WWW::Scraper extends the capability as follows:
4. "Backends" (herein referred to as "search engine interfaces") may be
specified using a number of different methods -
4a. Rules-based parsing (the so-called "Scraper frame"), combining HTML
tag-capture with text-capture and matching.
4b. HTML may be converted to XML via "HTML Tidy" (invoked by Scraper) and
parsed via XPATH-ish formulae.
4c. Rules may be extended by adding custom framing rules.
4d. All the above methods (including Perl) may be applied simultaneously in
any single search engine interface.
4e. Sherlock modules are automatically converted to Scraper frames.
5. Parsing is extended into the "detail" page(s) associated with each item
listed on the search engine's result list.
6. Canonical Request/Response Model: canonical queries are converted to
native queries, and native responses are converted to canonical responses.
For instance, "location" is specified by different search engines as
"zip=94043", "state=CA&city=Mountain View", or "areacode=650". All of these
are specified canonically as "location=US-CA-Mountain View", and translated
to the appropriate native field by the search engine interface. Native
response fields are similarly translated to the canonical form upon return.
(this obviously implies some-to-many and many-to-some translations, which
is accommodated easily by Scraper's array based field values).
7. Search engine interfaces will be bundled into categories. These will be
based on the Request/Response canon that each uses (e.g., Auction, Finance,
Housing, Job, etc). This will make it easier to maintain search engine
interfaces separate from the maintenance of the core Scraper.