[
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644269#comment-13644269
]
Tejas Patil commented on NUTCH-1314:
------------------------------------
Hi Lewis,
I tried to test both the patches. NUTCH-1314-trunk.patch gave compilation
errors:
{noformat} [javac]
/home/tejas/Desktop/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java:391:
error: cannot find symbol
[javac] fixEmbeddedParams(base, target) : new
URL(base, target);
[javac] ^
[javac] symbol: method fixEmbeddedParams(URL,String)
[javac] location: class DOMContentUtils
{noformat}
For NUTCH-1314-v2.patch:
I used [this|http://nutch.apache.org/about.html] url and ran the HtmlParser
parser.
Before applying the patch:
{noformat}bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
about.html
title: About Apache Nutch
text: About Apache Nutch Apache > Nutch > Home .................
outlinks: [toUrl: file:skin/basic.css anchor: , toUrl: file:skin/screen.css
anchor: , toUrl: file:skin/print.css anchor: , toUrl: file:skin/profile.css
anchor: , toUrl: file:skin/getBlank.js anchor: , toUrl: file:skin/getMenu.js
anchor: , toUrl: file:skin/fontsize.js anchor: , toUrl: file:images/favicon.ico
anchor: , toUrl: http://www.apache.org/ anchor: Apache, toUrl:
http://nutch.apache.org anchor: Nutch, toUrl: http://nutch.apache.org anchor:
Home, toUrl: file:skin/breadcrumbs.js anchor: , toUrl: http://www.apache.org/
anchor: , toUrl: file:images/feather-small.gif anchor: , toUrl:
http://nutch.apache.org/ anchor: , toUrl: file:images/nutch_logo_tm.gif anchor:
, toUrl: file:index.html anchor: Main, toUrl: file:wiki.html anchor: Wiki,
toUrl: http://issues.apache.org/jira/browse/NUTCH anchor: Jira, toUrl:
file:index.html anchor: News, toUrl: file:credits.html anchor: Credits, toUrl:
http://www.apache.org/foundation/thanks.html anchor: Thanks, toUrl:
http://www.cafepress.com/nutch/ anchor: Buy Stuff, toUrl:
http://www.apache.org/foundation/sponsorship.html anchor: Sponsorship, toUrl:
http://www.apache.org/licenses/ anchor: License, toUrl:
http://www.apache.org/security/ anchor: Security, toUrl: file:faq.html anchor:
FAQ, toUrl: file:wiki.html anchor: Wiki, toUrl: file:tutorial.html anchor:
Tutorial, toUrl: file:bot.html anchor: Robot, toUrl:
file:apidocs-2.1/index.html anchor: API Docs (2.1), toUrl:
file:apidocs-1.6/index.html anchor: API Docs (1.6), toUrl:
https://builds.apache.org/job/Nutch-trunk/javadoc/ anchor: API Docs (trunk
nightly), toUrl: https://builds.apache.org/job/Nutch-nutchgora/javadoc/ anchor:
API Docs (2.x nightly), toUrl: file:downloads.html anchor: Download, toUrl:
file:nightly.html anchor: Nightly builds, toUrl: file:sonar.html anchor: Sonar
Analysis, toUrl: file:mailing_lists.html anchor: Mailing Lists, toUrl:
file:issue_tracking.html anchor: Issue Tracking, toUrl:
file:version_control.html anchor: Version Control, toUrl:
file:old_downloads.html anchor: Older Downloads, toUrl:
http://lucene.apache.org/java/ anchor: Lucene, toUrl: http://hadoop.apache.org/
anchor: Hadoop, toUrl: http://lucene.apache.org/solr/ anchor: Solr, toUrl:
http://tika.apache.org/ anchor: Tika, toUrl: http://gora.apache.org anchor:
Gora, toUrl: file:skin/images/rc-b-l-15-1body-2menu-3menu.png anchor: , toUrl:
file:about.pdf anchor: PDF, toUrl: file:skin/images/pdfdoc.gif anchor: , toUrl:
file:about.html#Overview anchor: Overview, toUrl:
http://lucene.apache.org/java/ anchor: Apache Lucene, toUrl:
http://lucene.apache.org/solr/ anchor: Apache Solr, toUrl:
http://tika.apache.org/ anchor: Apache Tika, toUrl: http://hadoop.apache.org/
anchor: Hadoop cluster, toUrl: http://wiki.apache.org/nutch/ anchor: Nutch
wiki., toUrl: http://www.apache.org/licenses/ anchor: The Apache Software
Foundation. Apache Nutch, Nutch, Apache, the Apache feather logo, and the
Apache Nutch project logo are trademarks of The Apache Software
Foundation.]{noformat}
After applying the patch:
{noformat}bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
about.html
title: About Apache Nutch
text: About Apache Nutch Apache > Nutch > Home .................
outlinks: []{noformat}
Correct me if I am wrong: this patch would remove links of size > 3000. The
outlinks are not super lengthy and that patch should not have removed those.
> Impose a limit on the length of outlink target urls
> ---------------------------------------------------
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
> Issue Type: Improvement
> Reporter: Ferdy Galema
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1314.patch, NUTCH-1314-trunk.patch,
> NUTCH-1314-v2.patch
>
>
> In the past we have encountered situations where crawling specific broken
> sites resulted in ridiciously long urls that caused the stalling of tasks.
> The regex plugins (normalizing/filtering) processed single urls for hours, if
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It
> is a configurable limit, the default is 3000. This should be reasonably long
> enough for most uses. But sufficienly strict enough to make sure regex
> plugins do not choke on urls that are too long. Please see attached patch for
> the Nutchgora implementation.
> I'd like to hear what you think about this.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira