> ...after I get back from Thanksgiving dinner :-)
>
> 1. In URLFilterChecker, the cmd line tool requires URLs to be fed into it
> on STDIN, but that isn't documented anywhere, even in the tool help
> printed to STDOUT. I'll fix that.
>
> 2. In ParseOutputFormat, I see a code block:
>
> {code}
> // collect outlinks for subsequent db update
> Outlink[] links = parseData.getOutlinks();
> int outlinksToStore = Math.min(maxOutlinks, links.length);
> if (ignoreExternalLinks) {
> try {
> fromHost = new URL(fromUrl).getHost().toLowerCase();
> } catch (MalformedURLException e) {
> fromHost = null;
> }
> } else {
> fromHost = null;
> }
> {code}
>
> The if(ignoreExternalLinks) part then gets subsequently set and
> reset in the ensuing for loop:
>
> {code}
> int validCount = 0;
> CrawlDatum adjust = null;
> List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text,
> CrawlDatum>>(outlinksToStore); List<Outlink> outlinkList = new
> ArrayList<Outlink>(outlinksToStore); for (int i = 0; i < links.length &&
> validCount < outlinksToStore; i++) { String toUrl = links[i].getToUrl();
> // ignore links to self (or anchors within the page)
> if (fromUrl.equals(toUrl)) {
> continue;
> }
> if (ignoreExternalLinks) {
> try {
> toHost = new URL(toUrl).getHost().toLowerCase();
> } catch (MalformedURLException e) {
> toHost = null;
> }
> if (toHost == null || !toHost.equals(fromHost)) { // external
> links continue; // skip it
> }
> }
> {code}
>
> So, what's the point of that initial if(...) block outside of the for loop.
> Isn't it redundant?
This is trunk? I've been and still am working on some issues for a new feature
in this part of that source file.
https://issues.apache.org/jira/browse/NUTCH-1184
https://issues.apache.org/jira/browse/NUTCH-1174
>
> If so, I'll file an issue and fix that.
>
> Cheers,
> Chris
>
> P.S. Happy Thanksgiving to Nutch'ers in the US!
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++