[ 
https://issues.apache.org/jira/browse/NUTCH-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1212.
----------------------------------

    Resolution: Fixed

This is fixed in NUTCH-1184
                
> ParseOutputFormat has redundant code
> ------------------------------------
>
>                 Key: NUTCH-1212
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1212
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.5
>
>
> In ParseOutputFormat, I see a code block:
> {code}
>          // collect outlinks for subsequent db update
>          Outlink[] links = parseData.getOutlinks();
>          int outlinksToStore = Math.min(maxOutlinks, links.length);
>          if (ignoreExternalLinks) {
>            try {
>              fromHost = new URL(fromUrl).getHost().toLowerCase();
>            } catch (MalformedURLException e) {
>              fromHost = null;
>            }
>          } else {
>            fromHost = null;
>          }
> {code}
> The if(ignoreExternalLinks) part then gets subsequently set and 
> reset in the ensuing for loop:
> {code}
>          int validCount = 0;
>          CrawlDatum adjust = null;
>          List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text, 
> CrawlDatum>>(outlinksToStore);
>          List<Outlink> outlinkList = new ArrayList<Outlink>(outlinksToStore);
>          for (int i = 0; i < links.length && validCount < outlinksToStore; 
> i++) {
>            String toUrl = links[i].getToUrl();
>            // ignore links to self (or anchors within the page)
>            if (fromUrl.equals(toUrl)) {
>              continue;
>            }
>            if (ignoreExternalLinks) {
>              try {
>                toHost = new URL(toUrl).getHost().toLowerCase();
>              } catch (MalformedURLException e) {
>                toHost = null;
>              }
>              if (toHost == null || !toHost.equals(fromHost)) { // external 
> links
>                continue; // skip it
>              }
>            }
> {code}
> Isn't that redundant? I don't think the first if block is needed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to