> ...after I get back from Thanksgiving dinner :-)
> 
> 1. In URLFilterChecker, the cmd line tool requires URLs to be fed into it
> on STDIN, but that isn't documented anywhere, even in the tool help
> printed to STDOUT. I'll fix that.
> 
> 2. In ParseOutputFormat, I see a code block:
> 
> {code}
>           // collect outlinks for subsequent db update
>           Outlink[] links = parseData.getOutlinks();
>           int outlinksToStore = Math.min(maxOutlinks, links.length);
>           if (ignoreExternalLinks) {
>             try {
>               fromHost = new URL(fromUrl).getHost().toLowerCase();
>             } catch (MalformedURLException e) {
>               fromHost = null;
>             }
>           } else {
>             fromHost = null;
>           }
> {code}
> 
> The if(ignoreExternalLinks) part then gets subsequently set and
> reset in the ensuing for loop:
> 
> {code}
>           int validCount = 0;
>           CrawlDatum adjust = null;
>           List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text,
> CrawlDatum>>(outlinksToStore); List<Outlink> outlinkList = new
> ArrayList<Outlink>(outlinksToStore); for (int i = 0; i < links.length &&
> validCount < outlinksToStore; i++) { String toUrl = links[i].getToUrl();
>             // ignore links to self (or anchors within the page)
>             if (fromUrl.equals(toUrl)) {
>               continue;
>             }
>             if (ignoreExternalLinks) {
>               try {
>                 toHost = new URL(toUrl).getHost().toLowerCase();
>               } catch (MalformedURLException e) {
>                 toHost = null;
>               }
>               if (toHost == null || !toHost.equals(fromHost)) { // external
> links continue; // skip it
>               }
>             }
> {code}
> 
> So, what's the point of that initial if(...) block outside of the for loop.
> Isn't it redundant?

This is trunk? I've been and still am working on some issues for a new feature 
in this part of that source file.

https://issues.apache.org/jira/browse/NUTCH-1184
https://issues.apache.org/jira/browse/NUTCH-1174

> 
> If so, I'll file an issue and fix that.
> 
> Cheers,
> Chris
> 
> P.S. Happy Thanksgiving to Nutch'ers in the US!
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to