[
https://issues.apache.org/jira/browse/NUTCH-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-649:
---------------------------------------
Fix Version/s: 1.7
> Log list of files found but not crawled.
> ----------------------------------------
>
> Key: NUTCH-649
> URL: https://issues.apache.org/jira/browse/NUTCH-649
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Environment: any
> Reporter: Jim
> Fix For: 1.7
>
>
> I use Nutch to find the location of executables on the web, but we do
> not download the executables with Nutch. In order to get nutch to give the
> location of files without downloading the files, I had to make a very small
> patch to the code, but I think this change might be useful to others also.
> The patch just logs files that are being filtered at the info level, although
> perhaps it should be at the debug level.
> I have included a svn diff with this change. Use cases would be to both
> use as a diagnostic tool (let's see what we are skipping) as well as a way to
> find content and links pointed to by a page or site without having to
> actually download that content.
> Index: ParseOutputFormat.java
> ===================================================================
> --- ParseOutputFormat.java (revision 593619)
> +++ ParseOutputFormat.java (working copy)
> @@ -193,17 +193,20 @@
> toHost = null;
> }
> if (toHost == null || !toHost.equals(fromHost)) { // external
> links
> + LOG.info("filtering externalLink " + toUrl + " linked to by "
> + fromUrl);
> +
> continue; // skip it
> }
> }
> try {
> toUrl = normalizers.normalize(toUrl,
> URLNormalizers.SCOPE_OUTLINK); // normalize the url
> - toUrl = filters.filter(toUrl); // filter the url
> - if (toUrl == null) {
> - continue;
> - }
> - } catch (Exception e) {
> +
> + if (filters.filter(toUrl) == null) { // filter the url
> + LOG.info("filtering content " + toUrl + " linked to by
> " + fromUrl);
> + continue;
> + }
> + } catch (Exception e) {
> continue;
> }
> CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
> interval);
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira