ParseOutputFormat has redundant code
------------------------------------
Key: NUTCH-1212
URL: https://issues.apache.org/jira/browse/NUTCH-1212
Project: Nutch
Issue Type: Improvement
Components: parser
Affects Versions: 1.4
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
Fix For: 1.5
In ParseOutputFormat, I see a code block:
{code}
// collect outlinks for subsequent db update
Outlink[] links = parseData.getOutlinks();
int outlinksToStore = Math.min(maxOutlinks, links.length);
if (ignoreExternalLinks) {
try {
fromHost = new URL(fromUrl).getHost().toLowerCase();
} catch (MalformedURLException e) {
fromHost = null;
}
} else {
fromHost = null;
}
{code}
The if(ignoreExternalLinks) part then gets subsequently set and
reset in the ensuing for loop:
{code}
int validCount = 0;
CrawlDatum adjust = null;
List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text,
CrawlDatum>>(outlinksToStore);
List<Outlink> outlinkList = new ArrayList<Outlink>(outlinksToStore);
for (int i = 0; i < links.length && validCount < outlinksToStore; i++)
{
String toUrl = links[i].getToUrl();
// ignore links to self (or anchors within the page)
if (fromUrl.equals(toUrl)) {
continue;
}
if (ignoreExternalLinks) {
try {
toHost = new URL(toUrl).getHost().toLowerCase();
} catch (MalformedURLException e) {
toHost = null;
}
if (toHost == null || !toHost.equals(fromHost)) { // external links
continue; // skip it
}
}
{code}
Isn't that redundant? I don't think the first if block is needed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira