[ 
https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1329:
----------------------------------------

    Fix Version/s: 2.2
                   1.7
    
> parser not extract outlinks to external web sites
> -------------------------------------------------
>
>                 Key: NUTCH-1329
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1329
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: behnam nikbakht
>              Labels: parse
>             Fix For: 1.7, 2.2
>
>
> found a bug in 
> /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java,
>  that outlinks like www.example2.com from www.example1.com are inserted as 
> www.example1.com/www.example2.com
> i correct this bug by testing that if outlink (www.example2.com) is a valid 
> url, else inserted with it's base url
> so i replace these lines:
>                 URL url = URLUtil.resolveURL(base, target);
>                 outlinks.add(new Outlink(url.toString(),
>                                          linkText.toString().trim()));
> with:
>                 String host_temp=null;
>                 try{
>                         host_temp=URLUtil.getDomainName(new URL(target));
>                 }
>                 catch(Exception eiuy){
>                         host_temp=null;
>                 }
>                 URL url=null;
>                 if(host_temp==null)// it is an internal outlink
>                     url = URLUtil.resolveURL(base, target);
>                 else //it is an external link
>                         url=new URL(target);
>                 outlinks.add(new Outlink(url.toString(),
>                                          linkText.toString().trim()));

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to