I'm trying and learning Nutch. One application area where I'm looking for a better tool than my current shell/Perl scripts is news harvesting. Can Nutch do this? Or are there some other tools I should be looking for?
For newspapers that don't provide RSS feeds, I currently have a cron script that uses wget to fetch the front page of the newspaper's website a few times per day. My script applies some magic regexps to extract news headlines and links, essentially the link anchor text, and then inserts (1) the URL, (2) the headline, and (3) the timestamp into a MySQL database table where the URL is the primary key. If the same URL was already in the database, the insert fails. Every URL is only stored once, and the timestamp indicates when it was first seen. Afterwards I can extract the most recent headlines as an RSS feed, or I can select the timestamps for all headlines that contain the word Greenspan, to track how that news topic has varied over time. What I have today is a stupid cron script. It doesn't notice that some sources are daily newspapers and others are monthly magazines. A better solution should count how many news items were found in a fetch, and adopt the fetch interval to each source. I figure this would be an economic feature of any search robot, that would index fast changing web pages more often than static web pages. Does Nutch have that smart adoption of the fetch interval? Does Nutch save the timestamp when an URL was first seen? Does Nutch find news headlines that isn't anchor text? Some newspapers use this format, and you don't want to store "read more" as the anchor text: GREENSPAN DOES NOTHING Wash. DC. Today it was announced... <a href="article_4711.html" >read more</a> What tools are there to extract or select the data out of the Nutch database, and is there some good tutorial or documentation on that, except the source code? Also, does Nutch record the response time and availability for each URL, and is there a way to extract this information from the database (from the command prompt)? -- Lars Aronsson ([EMAIL PROTECTED]) Aronsson Datateknik - http://aronsson.se/ ------------------------------------------------------- This SF.Net email sponsored by Black Hat Briefings & Training. Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
