[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644269#comment-13644269
 ] 

Tejas Patil commented on NUTCH-1314:
------------------------------------

Hi Lewis,
I tried to test both the patches. NUTCH-1314-trunk.patch gave compilation 
errors:
{noformat}    [javac] 
/home/tejas/Desktop/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java:391:
 error: cannot find symbol
    [javac]                     fixEmbeddedParams(base, target) :  new 
URL(base, target);
    [javac]                     ^
    [javac]   symbol:   method fixEmbeddedParams(URL,String)
    [javac]   location: class DOMContentUtils
{noformat}

For NUTCH-1314-v2.patch:
I used [this|http://nutch.apache.org/about.html] url and ran the HtmlParser 
parser.

Before applying the patch:
{noformat}bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser 
about.html
title: About Apache Nutch
text: About Apache Nutch Apache > Nutch > Home   .................
outlinks: [toUrl: file:skin/basic.css anchor: , toUrl: file:skin/screen.css 
anchor: , toUrl: file:skin/print.css anchor: , toUrl: file:skin/profile.css 
anchor: , toUrl: file:skin/getBlank.js anchor: , toUrl: file:skin/getMenu.js 
anchor: , toUrl: file:skin/fontsize.js anchor: , toUrl: file:images/favicon.ico 
anchor: , toUrl: http://www.apache.org/ anchor: Apache, toUrl: 
http://nutch.apache.org anchor: Nutch, toUrl: http://nutch.apache.org anchor: 
Home, toUrl: file:skin/breadcrumbs.js anchor: , toUrl: http://www.apache.org/ 
anchor: , toUrl: file:images/feather-small.gif anchor: , toUrl: 
http://nutch.apache.org/ anchor: , toUrl: file:images/nutch_logo_tm.gif anchor: 
, toUrl: file:index.html anchor: Main, toUrl: file:wiki.html anchor: Wiki, 
toUrl: http://issues.apache.org/jira/browse/NUTCH anchor: Jira, toUrl: 
file:index.html anchor: News, toUrl: file:credits.html anchor: Credits, toUrl: 
http://www.apache.org/foundation/thanks.html anchor: Thanks, toUrl: 
http://www.cafepress.com/nutch/ anchor: Buy Stuff, toUrl: 
http://www.apache.org/foundation/sponsorship.html anchor: Sponsorship, toUrl: 
http://www.apache.org/licenses/ anchor: License, toUrl: 
http://www.apache.org/security/ anchor: Security, toUrl: file:faq.html anchor: 
FAQ, toUrl: file:wiki.html anchor: Wiki, toUrl: file:tutorial.html anchor: 
Tutorial, toUrl: file:bot.html anchor: Robot, toUrl: 
file:apidocs-2.1/index.html anchor: API Docs (2.1), toUrl: 
file:apidocs-1.6/index.html anchor: API Docs (1.6), toUrl: 
https://builds.apache.org/job/Nutch-trunk/javadoc/ anchor: API Docs (trunk 
nightly), toUrl: https://builds.apache.org/job/Nutch-nutchgora/javadoc/ anchor: 
API Docs (2.x nightly), toUrl: file:downloads.html anchor: Download, toUrl: 
file:nightly.html anchor: Nightly builds, toUrl: file:sonar.html anchor: Sonar 
Analysis, toUrl: file:mailing_lists.html anchor: Mailing Lists, toUrl: 
file:issue_tracking.html anchor: Issue Tracking, toUrl: 
file:version_control.html anchor: Version Control, toUrl: 
file:old_downloads.html anchor: Older Downloads, toUrl: 
http://lucene.apache.org/java/ anchor: Lucene, toUrl: http://hadoop.apache.org/ 
anchor: Hadoop, toUrl: http://lucene.apache.org/solr/ anchor: Solr, toUrl: 
http://tika.apache.org/ anchor: Tika, toUrl: http://gora.apache.org anchor: 
Gora, toUrl: file:skin/images/rc-b-l-15-1body-2menu-3menu.png anchor: , toUrl: 
file:about.pdf anchor: PDF, toUrl: file:skin/images/pdfdoc.gif anchor: , toUrl: 
file:about.html#Overview anchor: Overview, toUrl: 
http://lucene.apache.org/java/ anchor: Apache Lucene, toUrl: 
http://lucene.apache.org/solr/ anchor: Apache Solr, toUrl: 
http://tika.apache.org/ anchor: Apache Tika, toUrl: http://hadoop.apache.org/ 
anchor: Hadoop cluster, toUrl: http://wiki.apache.org/nutch/ anchor: Nutch 
wiki., toUrl: http://www.apache.org/licenses/ anchor: The Apache Software 
Foundation. Apache Nutch, Nutch, Apache, the Apache feather logo, and the 
Apache Nutch project logo are trademarks of The Apache Software 
Foundation.]{noformat}

After applying the patch:
{noformat}bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser 
about.html
title: About Apache Nutch
text: About Apache Nutch Apache > Nutch > Home   .................
outlinks: []{noformat}

Correct me if I am wrong: this patch would remove links of size > 3000. The 
outlinks are not super lengthy and that patch should not have removed those.
                
> Impose a limit on the length of outlink target urls
> ---------------------------------------------------
>
>                 Key: NUTCH-1314
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1314
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 1.7, 2.2
>
>         Attachments: NUTCH-1314.patch, NUTCH-1314-trunk.patch, 
> NUTCH-1314-v2.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to