ParseOutputFormat has redundant code
------------------------------------

                 Key: NUTCH-1212
                 URL: https://issues.apache.org/jira/browse/NUTCH-1212
             Project: Nutch
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.4
            Reporter: Chris A. Mattmann
            Assignee: Chris A. Mattmann
            Priority: Minor
             Fix For: 1.5


In ParseOutputFormat, I see a code block:

{code}
         // collect outlinks for subsequent db update
         Outlink[] links = parseData.getOutlinks();
         int outlinksToStore = Math.min(maxOutlinks, links.length);
         if (ignoreExternalLinks) {
           try {
             fromHost = new URL(fromUrl).getHost().toLowerCase();
           } catch (MalformedURLException e) {
             fromHost = null;
           }
         } else {
           fromHost = null;
         }
{code}

The if(ignoreExternalLinks) part then gets subsequently set and 
reset in the ensuing for loop:

{code}
         int validCount = 0;
         CrawlDatum adjust = null;
         List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text, 
CrawlDatum>>(outlinksToStore);
         List<Outlink> outlinkList = new ArrayList<Outlink>(outlinksToStore);
         for (int i = 0; i < links.length && validCount < outlinksToStore; i++) 
{
           String toUrl = links[i].getToUrl();
           // ignore links to self (or anchors within the page)
           if (fromUrl.equals(toUrl)) {
             continue;
           }
           if (ignoreExternalLinks) {
             try {
               toHost = new URL(toUrl).getHost().toLowerCase();
             } catch (MalformedURLException e) {
               toHost = null;
             }
             if (toHost == null || !toHost.equals(fromHost)) { // external links
               continue; // skip it
             }
           }
{code}

Isn't that redundant? I don't think the first if block is needed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to