[
https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tejas Patil closed NUTCH-1329.
------------------------------
Resolution: Cannot Reproduce
Closing for now by marking it "cannot reproduce"
> parser not extract outlinks to external web sites
> -------------------------------------------------
>
> Key: NUTCH-1329
> URL: https://issues.apache.org/jira/browse/NUTCH-1329
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.4
> Reporter: behnam nikbakht
> Labels: parse
> Fix For: 2.3, 1.8
>
>
> found a bug in
> /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java,
> that outlinks like www.example2.com from www.example1.com are inserted as
> www.example1.com/www.example2.com
> i correct this bug by testing that if outlink (www.example2.com) is a valid
> url, else inserted with it's base url
> so i replace these lines:
> URL url = URLUtil.resolveURL(base, target);
> outlinks.add(new Outlink(url.toString(),
> linkText.toString().trim()));
> with:
> String host_temp=null;
> try{
> host_temp=URLUtil.getDomainName(new URL(target));
> }
> catch(Exception eiuy){
> host_temp=null;
> }
> URL url=null;
> if(host_temp==null)// it is an internal outlink
> url = URLUtil.resolveURL(base, target);
> else //it is an external link
> url=new URL(target);
> outlinks.add(new Outlink(url.toString(),
> linkText.toString().trim()));
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira