[jira] Commented: (NUTCH-944) Increase the number of elements to look for URLs and add the ability to specify multiple attributes by elements

2011-02-25 Thread Jean-Francois Gingras (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999373#comment-12999373
 ] 

Jean-Francois Gingras commented on NUTCH-944:
-

We are currently moving to Nutch 1.2, I will provide a patch for it. I also 
change the code to use string.split('') as suggested.

I will try to make time to provide a patch for 2.0, but I was not able to get 
2.0 to compile yet.

 Increase the number of elements to look for URLs and add the ability to 
 specify multiple attributes by elements
 ---

 Key: NUTCH-944
 URL: https://issues.apache.org/jira/browse/NUTCH-944
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
 Environment: GNU/Linux Fedora 12
Reporter: Jean-Francois Gingras
Priority: Minor
 Fix For: 1.3

 Attachments: DOMContentUtils.java.path-1.0, 
 DOMContentUtils.java.path-1.3


 Here a patch for DOMContentUtils.java that increase the number of elements to 
 look for URLs. It also add the ability to specify multiple attributes by 
 elements, for example:
 linkParams.put(frame, new LinkParams(frame, longdesc,src, 0));
 linkParams.put(object, new LinkParams(object, 
 classid,codebase,data,usemap, 0));
 linkParams.put(video, new LinkParams(video, poster,src, 0)); // HTML 5
 I have a patch for release-1.0 and branch-1.3
 I would love to hear your comments about this.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Nutch Parser annoyingly faulty

2011-02-25 Thread Juergen Specht

Hi Nutch Team,

before I permanently reject Nutch from all my sites, I better tell
you why...your URL parser is extremely faulty and creates a lot of
trouble.

Here is an example, if you have a link on a page, say:

http://www.somesite/somepage/

and the link in HTML looks like:

a href=.This Page/a

the parser should identify that the . (dot) refers
to this URL:

http://www.somesite/somepage/

and not to:

http://www.somesite/somepage/.

Every single browser does it correctly, why not Nutch?

Why is this important? Many new sites don't use the traditional
mapping of directories from the URL model anymore, but instead
have controllers, actions, parameters etc. encoded in the URL.

They get split by a separator, which often is / (slash), so if
there is a trailing dot, it requests a different resource than
without the dot. By ignoring the dot in the backend to cope with
Nutch' faulty parser would create at least 2 URL's sending the
same content, which then again might affect your Google ranking.

Also, Nutch parses compressed Javascript files, which are all
written in one long line, then somehow take part of the code and
add it to the URL, creating a huge array of 404's on the server
side.

Example, you have a URL to a Javascript file like this:

 http://www.somesite/javascript/foo.js

Nutch parses this and then accesses random (?) new URLs which look like:

http://www.somesite/javascript/someFunction();

etc etc.

Please, please, please fix Nutch!

Thanks,

Juergen
--
Shakodo - The road to profitable photography: http://www.shakodo.com/


Build failed in Hudson: Nutch-trunk #1409

2011-02-25 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/Nutch-trunk/1409/

--
[...truncated 1008 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A src/plugin/parse-html/src/test/org/apache/nutch
A src/plugin/parse-html/src/test/org/apache/nutch/parse
A