[
https://issues.apache.org/jira/browse/NUTCH-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma resolved NUTCH-1212.
----------------------------------
Resolution: Fixed
This is fixed in NUTCH-1184
> ParseOutputFormat has redundant code
> ------------------------------------
>
> Key: NUTCH-1212
> URL: https://issues.apache.org/jira/browse/NUTCH-1212
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.4
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 1.5
>
>
> In ParseOutputFormat, I see a code block:
> {code}
> // collect outlinks for subsequent db update
> Outlink[] links = parseData.getOutlinks();
> int outlinksToStore = Math.min(maxOutlinks, links.length);
> if (ignoreExternalLinks) {
> try {
> fromHost = new URL(fromUrl).getHost().toLowerCase();
> } catch (MalformedURLException e) {
> fromHost = null;
> }
> } else {
> fromHost = null;
> }
> {code}
> The if(ignoreExternalLinks) part then gets subsequently set and
> reset in the ensuing for loop:
> {code}
> int validCount = 0;
> CrawlDatum adjust = null;
> List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text,
> CrawlDatum>>(outlinksToStore);
> List<Outlink> outlinkList = new ArrayList<Outlink>(outlinksToStore);
> for (int i = 0; i < links.length && validCount < outlinksToStore;
> i++) {
> String toUrl = links[i].getToUrl();
> // ignore links to self (or anchors within the page)
> if (fromUrl.equals(toUrl)) {
> continue;
> }
> if (ignoreExternalLinks) {
> try {
> toHost = new URL(toUrl).getHost().toLowerCase();
> } catch (MalformedURLException e) {
> toHost = null;
> }
> if (toHost == null || !toHost.equals(fromHost)) { // external
> links
> continue; // skip it
> }
> }
> {code}
> Isn't that redundant? I don't think the first if block is needed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira