[jira] Commented: (NUTCH-944) Increase the number of elements to look for URLs and add the ability to specify multiple attributes by elements
[ https://issues.apache.org/jira/browse/NUTCH-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999373#comment-12999373 ] Jean-Francois Gingras commented on NUTCH-944: - We are currently moving to Nutch 1.2, I will provide a patch for it. I also change the code to use string.split('') as suggested. I will try to make time to provide a patch for 2.0, but I was not able to get 2.0 to compile yet. Increase the number of elements to look for URLs and add the ability to specify multiple attributes by elements --- Key: NUTCH-944 URL: https://issues.apache.org/jira/browse/NUTCH-944 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.3 Environment: GNU/Linux Fedora 12 Reporter: Jean-Francois Gingras Priority: Minor Fix For: 1.3 Attachments: DOMContentUtils.java.path-1.0, DOMContentUtils.java.path-1.3 Here a patch for DOMContentUtils.java that increase the number of elements to look for URLs. It also add the ability to specify multiple attributes by elements, for example: linkParams.put(frame, new LinkParams(frame, longdesc,src, 0)); linkParams.put(object, new LinkParams(object, classid,codebase,data,usemap, 0)); linkParams.put(video, new LinkParams(video, poster,src, 0)); // HTML 5 I have a patch for release-1.0 and branch-1.3 I would love to hear your comments about this. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
Nutch Parser annoyingly faulty
Hi Nutch Team, before I permanently reject Nutch from all my sites, I better tell you why...your URL parser is extremely faulty and creates a lot of trouble. Here is an example, if you have a link on a page, say: http://www.somesite/somepage/ and the link in HTML looks like: a href=.This Page/a the parser should identify that the . (dot) refers to this URL: http://www.somesite/somepage/ and not to: http://www.somesite/somepage/. Every single browser does it correctly, why not Nutch? Why is this important? Many new sites don't use the traditional mapping of directories from the URL model anymore, but instead have controllers, actions, parameters etc. encoded in the URL. They get split by a separator, which often is / (slash), so if there is a trailing dot, it requests a different resource than without the dot. By ignoring the dot in the backend to cope with Nutch' faulty parser would create at least 2 URL's sending the same content, which then again might affect your Google ranking. Also, Nutch parses compressed Javascript files, which are all written in one long line, then somehow take part of the code and add it to the URL, creating a huge array of 404's on the server side. Example, you have a URL to a Javascript file like this: http://www.somesite/javascript/foo.js Nutch parses this and then accesses random (?) new URLs which look like: http://www.somesite/javascript/someFunction(); etc etc. Please, please, please fix Nutch! Thanks, Juergen -- Shakodo - The road to profitable photography: http://www.shakodo.com/
Build failed in Hudson: Nutch-trunk #1409
See https://hudson.apache.org/hudson/job/Nutch-trunk/1409/ -- [...truncated 1008 lines...] A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java A src/plugin/subcollection/README.txt A src/plugin/subcollection/plugin.xml A src/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more A src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A src/plugin/index-more/src/java/org A src/plugin/index-more/src/java/org/apache A src/plugin/index-more/src/java/org/apache/nutch A src/plugin/index-more/src/java/org/apache/nutch/indexer A src/plugin/index-more/src/java/org/apache/nutch/indexer/more A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html A src/plugin/index-more/plugin.xml A src/plugin/index-more/build.xml AUsrc/plugin/plugin.dtd A src/plugin/parse-ext A src/plugin/parse-ext/ivy.xml A src/plugin/parse-ext/src A src/plugin/parse-ext/src/test A src/plugin/parse-ext/src/test/org A src/plugin/parse-ext/src/test/org/apache A src/plugin/parse-ext/src/test/org/apache/nutch A src/plugin/parse-ext/src/test/org/apache/nutch/parse A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java A src/plugin/parse-ext/src/java A src/plugin/parse-ext/src/java/org A src/plugin/parse-ext/src/java/org/apache A src/plugin/parse-ext/src/java/org/apache/nutch A src/plugin/parse-ext/src/java/org/apache/nutch/parse A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java A src/plugin/parse-ext/plugin.xml A src/plugin/parse-ext/build.xml A src/plugin/parse-ext/command A src/plugin/urlnormalizer-pass A src/plugin/urlnormalizer-pass/ivy.xml A src/plugin/urlnormalizer-pass/src A src/plugin/urlnormalizer-pass/src/test A src/plugin/urlnormalizer-pass/src/test/org A src/plugin/urlnormalizer-pass/src/test/org/apache A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java A src/plugin/urlnormalizer-pass/src/java A src/plugin/urlnormalizer-pass/src/java/org A src/plugin/urlnormalizer-pass/src/java/org/apache A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java AUsrc/plugin/urlnormalizer-pass/plugin.xml AUsrc/plugin/urlnormalizer-pass/build.xml A src/plugin/parse-html A src/plugin/parse-html/ivy.xml A src/plugin/parse-html/lib A src/plugin/parse-html/lib/tagsoup.LICENSE.txt A src/plugin/parse-html/src A src/plugin/parse-html/src/test A src/plugin/parse-html/src/test/org A src/plugin/parse-html/src/test/org/apache A src/plugin/parse-html/src/test/org/apache/nutch A src/plugin/parse-html/src/test/org/apache/nutch/parse A