parser not extract outlinks to external web sites
-------------------------------------------------
Key: NUTCH-1329
URL: https://issues.apache.org/jira/browse/NUTCH-1329
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 1.4
Reporter: behnam nikbakht
found a bug in
/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java,
that outlinks like www.example2.com from www.example1.com are inserted as
www.example1.com/www.example2.com
i correct this bug by testing that if outlink (www.example2.com) is a valid
url, else inserted with it's base url
so i replace these lines:
URL url = URLUtil.resolveURL(base, target);
outlinks.add(new Outlink(url.toString(),
linkText.toString().trim()));
with:
String host_temp=null;
try{
host_temp=URLUtil.getDomainName(new URL(target));
}
catch(Exception eiuy){
host_temp=null;
}
URL url=null;
if(host_temp==null)// it is an internal outlink
url = URLUtil.resolveURL(base, target);
else //it is an external link
url=new URL(target);
outlinks.add(new Outlink(url.toString(),
linkText.toString().trim()));
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira