...after I get back from Thanksgiving dinner :-)
1. In URLFilterChecker, the cmd line tool requires URLs to be fed into it on
STDIN, but
that isn't documented anywhere, even in the tool help printed to STDOUT. I'll
fix that.
2. In ParseOutputFormat, I see a code block:
{code}
// collect outlinks for subsequent db update
Outlink[] links = parseData.getOutlinks();
int outlinksToStore = Math.min(maxOutlinks, links.length);
if (ignoreExternalLinks) {
try {
fromHost = new URL(fromUrl).getHost().toLowerCase();
} catch (MalformedURLException e) {
fromHost = null;
}
} else {
fromHost = null;
}
{code}
The if(ignoreExternalLinks) part then gets subsequently set and
reset in the ensuing for loop:
{code}
int validCount = 0;
CrawlDatum adjust = null;
List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text,
CrawlDatum>>(outlinksToStore);
List<Outlink> outlinkList = new ArrayList<Outlink>(outlinksToStore);
for (int i = 0; i < links.length && validCount < outlinksToStore;
i++) {
String toUrl = links[i].getToUrl();
// ignore links to self (or anchors within the page)
if (fromUrl.equals(toUrl)) {
continue;
}
if (ignoreExternalLinks) {
try {
toHost = new URL(toUrl).getHost().toLowerCase();
} catch (MalformedURLException e) {
toHost = null;
}
if (toHost == null || !toHost.equals(fromHost)) { // external
links
continue; // skip it
}
}
{code}
So, what's the point of that initial if(...) block outside of the for loop.
Isn't it
redundant?
If so, I'll file an issue and fix that.
Cheers,
Chris
P.S. Happy Thanksgiving to Nutch'ers in the US!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++