Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "DebugTool" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/DebugTool?action=diff&rev1=4&rev2=5

  It should be possible to generate information that would have answered all of 
the "is it X" questions that came up during a user's crawl. E.g.
  
   1. which URLs were put on the fetch list, versus skipped.
-  1. which fetched documents were truncated.
+  1. which fetched documents were truncated. 
+     - The code currently has primitive logging for all parse plugins to log 
verification of truncation to stdout. What more could we do here? It is a 
common problem so would be good to improve this area.
   1. which URLs in a parsed page were skipped, due to the max outlinks per 
page limit.
   1. which URLs got filtered by regex, prefix, suffix, domain filters
   1. exclusions by robots directives

Reply via email to