Lars Aronsson wrote:
Does Nutch have that smart adoption of the fetch interval?

Not yet. We do store a separate fetch interval for each page, but this is currently not set except to the default.


Does Nutch save the timestamp when an URL was first seen?

No, although that might be a good addition. It would simply be a new field in Page.


Does Nutch find news headlines that isn't anchor text?  Some
newspapers use this format, and you don't want to store "read more"
as the anchor text:

   GREENSPAN DOES NOTHING
   Wash. DC.  Today it was announced...
   <a href="article_4711.html" >read more</a>

Nutch does not yet have heuristics to grab better anchor text in this case.

What tools are there to extract or select the data out of the
Nutch database, and is there some good tutorial or documentation on
that, except the source code?

The documentation is not great. The tool that selects pages for fetch is FetchListTool.java.


Also, does Nutch record the response time and availability for each
URL, and is there a way to extract this information from the database
(from the command prompt)?

We do not yet store such information in the database.

Doug



-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to