parser not extract outlinks to external web sites
-------------------------------------------------

                 Key: NUTCH-1329
                 URL: https://issues.apache.org/jira/browse/NUTCH-1329
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.4
            Reporter: behnam nikbakht


found a bug in 
/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java,
 that outlinks like www.example2.com from www.example1.com are inserted as 
www.example1.com/www.example2.com
i correct this bug by testing that if outlink (www.example2.com) is a valid 
url, else inserted with it's base url
so i replace these lines:
                URL url = URLUtil.resolveURL(base, target);
                outlinks.add(new Outlink(url.toString(),
                                         linkText.toString().trim()));

with:
                String host_temp=null;
                try{
                        host_temp=URLUtil.getDomainName(new URL(target));
                }
                catch(Exception eiuy){
                        host_temp=null;
                }
                URL url=null;
                if(host_temp==null)// it is an internal outlink
                    url = URLUtil.resolveURL(base, target);
                else //it is an external link
                        url=new URL(target);
                outlinks.add(new Outlink(url.toString(),
                                         linkText.toString().trim()));


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to