I'm trying and learning Nutch.  One application area where I'm looking
for a better tool than my current shell/Perl scripts is news
harvesting.  Can Nutch do this?  Or are there some other tools I
should be looking for?

For newspapers that don't provide RSS feeds, I currently have a cron
script that uses wget to fetch the front page of the newspaper's
website a few times per day.  My script applies some magic regexps to
extract news headlines and links, essentially the link anchor text,
and then inserts (1) the URL, (2) the headline, and (3) the timestamp
into a MySQL database table where the URL is the primary key.  If the
same URL was already in the database, the insert fails.  Every URL is
only stored once, and the timestamp indicates when it was first seen.

Afterwards I can extract the most recent headlines as an RSS feed, or
I can select the timestamps for all headlines that contain the word
Greenspan, to track how that news topic has varied over time.

What I have today is a stupid cron script.  It doesn't notice that
some sources are daily newspapers and others are monthly magazines.
A better solution should count how many news items were found in a
fetch, and adopt the fetch interval to each source.  I figure this
would be an economic feature of any search robot, that would index
fast changing web pages more often than static web pages.

Does Nutch have that smart adoption of the fetch interval?

Does Nutch save the timestamp when an URL was first seen?

Does Nutch find news headlines that isn't anchor text?  Some
newspapers use this format, and you don't want to store "read more"
as the anchor text:

   GREENSPAN DOES NOTHING
   Wash. DC.  Today it was announced...
   <a href="article_4711.html" >read more</a>

What tools are there to extract or select the data out of the
Nutch database, and is there some good tutorial or documentation on
that, except the source code?

Also, does Nutch record the response time and availability for each
URL, and is there a way to extract this information from the database
(from the command prompt)?



-- 
  Lars Aronsson ([EMAIL PROTECTED])
  Aronsson Datateknik - http://aronsson.se/



-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to