[jira] [Updated] (NUTCH-649) Log list of files found but not crawled.

Lewis John McGibbney (JIRA) Sat, 12 Jan 2013 11:24:15 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lewis John McGibbney updated NUTCH-649:
---------------------------------------

    Fix Version/s: 1.7
    
> Log list of files found but not crawled.
> ----------------------------------------
>
>                 Key: NUTCH-649
>                 URL: https://issues.apache.org/jira/browse/NUTCH-649
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>         Environment: any
>            Reporter: Jim
>             Fix For: 1.7
>
>
>         I use Nutch to find the location of executables on the web, but we do 
> not download the executables with Nutch.  In order to get nutch to give the 
> location of files without downloading the files, I had to make a very small 
> patch to the code, but I think this change might be useful to others also.  
> The patch just logs files that are being filtered at the info level, although 
> perhaps it should be at the debug level.
>    I have included a svn diff with this change.  Use cases would be to both 
> use as a diagnostic tool (let's see what we are skipping) as well as a way to 
> find content and links pointed to by a page or site without having to 
> actually download that content.
> Index: ParseOutputFormat.java
> ===================================================================
> --- ParseOutputFormat.java      (revision 593619)
> +++ ParseOutputFormat.java      (working copy)
> @@ -193,17 +193,20 @@
>                 toHost = null;
>               }
>               if (toHost == null || !toHost.equals(fromHost)) { // external 
> links
> +               LOG.info("filtering externalLink " + toUrl + " linked to by " 
> + fromUrl);
> +
>                 continue; // skip it
>               }
>             }
>             try {
>               toUrl = normalizers.normalize(toUrl,
>                           URLNormalizers.SCOPE_OUTLINK); // normalize the url
> -              toUrl = filters.filter(toUrl);   // filter the url
> -              if (toUrl == null) {
> -                continue;
> -              }
> -            } catch (Exception e) {
> +
> +             if (filters.filter(toUrl) == null) {   // filter the url
> +                     LOG.info("filtering content " + toUrl + " linked to by 
> " + fromUrl);
> +                     continue;
> +                 }
> +           } catch (Exception e) {
>               continue;
>             }
>             CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED, 
> interval);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-649) Log list of files found but not crawled.

Reply via email to