[Nutch Wiki] Trivial Update of PluginCentral by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The PluginCentral page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/PluginCentral?action=diffrev1=83rev2=84 * AboutPlugins - General information on what plugins are and how they work. * [[WhichTechnicalConceptsAreBehindTheNutchPluginSystem|Technical Concepts Behind the Nutch Plugin System]] * [[WhatsTheProblemWithPluginsAndClass-loading|Problems with Plugins and Class-Loading]] - * WritingPluginExample - A step-by-step example of how to write a plugin for Nutch-1.3 + * WritingPluginExample - A step-by-step example of how to write a plugin using the 1.x API. * [[http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/|Writing a plugin to add dates]] by Ryan Pfister * PluginGotchas - Yep there are some Gotchas you need to consider. * TikaPlugin - Comments on the Tika integration and differences with existing parse plugins
[jira] [Resolved] (NUTCH-1593) normalize option missing in SegmentMerger's usage
[ https://issues.apache.org/jira/browse/NUTCH-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1593. -- Resolution: Fixed Committed in trunk in rev. 1498346. normalize option missing in SegmentMerger's usage - Key: NUTCH-1593 URL: https://issues.apache.org/jira/browse/NUTCH-1593 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.8 Attachments: NUTCH-1593.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1581) CrawlDB csv output to include metadata
[ https://issues.apache.org/jira/browse/NUTCH-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696691#comment-13696691 ] Markus Jelsma commented on NUTCH-1581: -- I'll commit this one unless there are objections. Thanks CrawlDB csv output to include metadata -- Key: NUTCH-1581 URL: https://issues.apache.org/jira/browse/NUTCH-1581 Project: Nutch Issue Type: Improvement Components: crawldb Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1581-1.8.patch Dumping the CrawlDB to CSV should include the CrawlDatum's metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1327) QueryStringNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696705#comment-13696705 ] Markus Jelsma commented on NUTCH-1327: -- Any comments? Thanks QueryStringNormalizer - Key: NUTCH-1327 URL: https://issues.apache.org/jira/browse/NUTCH-1327 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.9 Attachments: NUTCH-1327-1.8-1.patch A normalizer for dealing with query strings. Sorting query strings is helpful in preventing duplicates for some (bad) websites. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1593) normalize option missing in SegmentMerger's usage
[ https://issues.apache.org/jira/browse/NUTCH-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696746#comment-13696746 ] Hudson commented on NUTCH-1593: --- Integrated in Nutch-trunk #2263 (See [https://builds.apache.org/job/Nutch-trunk/2263/]) NUTCH-1593 Normalize option missing in SegmentMerger's usage (Revision 1498346) Result = SUCCESS markus : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1498346 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java normalize option missing in SegmentMerger's usage - Key: NUTCH-1593 URL: https://issues.apache.org/jira/browse/NUTCH-1593 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.8 Attachments: NUTCH-1593.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Jenkins build is back to normal : Nutch-trunk #2263
See https://builds.apache.org/job/Nutch-trunk/2263/changes
Adding nutch stage
Hi, I'd like to add a new stage called updatescore after updatedb to Nutch 2.1. I tried two ways for this: 1) public class ScoreUpdaterJob extends NutchTool implements Tool; Nutch requires me to define the InputFormat, OutputFormat etc. to perform Map-reduce calculations. I don't want to perform map-reduce but call a Giraph job to run on Hadoop. When it's finished, Nutch can go on its way. 2) public class ScoreUpdaterJob implements Tool; or public class ScoreUpdaterJob; Then I can't use setJarClass of NutchTool, so hadoop job fails: Caused by: java.lang.ClassNotFoundException: org.apache.giraph.examples.LinkRank.LinkRankComputation How can I fix this? What's the best way to add a giraph job as a Nutch stage? Thanks,
[jira] [Commented] (NUTCH-1594) count variable is never changed in ParseUtil class
[ https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696798#comment-13696798 ] lufeng commented on NUTCH-1594: --- Committed @revision 1498437 in 2.x HEAD. Thanks Canan and Lewis. count variable is never changed in ParseUtil class -- Key: NUTCH-1594 URL: https://issues.apache.org/jira/browse/NUTCH-1594 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.3 Attachments: NUTCH-1594.patch in ParseUtil class the count variable is never change. the code is like this for (int i = 0; count maxOutlinks i outlinks.length; i++) so even if you define the db.max.outlinks.per.page parameter, it will not take effect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1327) QueryStringNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696840#comment-13696840 ] Tejas Patil commented on NUTCH-1327: Hi Markus, 1. The patch when applied as is didn't compile the plugin. I had to add entries into src/plugin/build.xml to get it compiled. 2. Can you kindly add some javadoc comments in QuerystringURLNormalizer class so that people can quickly get an idea about what this plugin would do ? QueryStringNormalizer - Key: NUTCH-1327 URL: https://issues.apache.org/jira/browse/NUTCH-1327 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.9 Attachments: NUTCH-1327-1.8-1.patch A normalizer for dealing with query strings. Sorting query strings is helpful in preventing duplicates for some (bad) websites. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1327) QueryStringNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696854#comment-13696854 ] lufeng commented on NUTCH-1327: --- Hi Markus, I tested you patch, Do you forget to add deploy and test target into src/plugin/build.xml? +1 QueryStringNormalizer - Key: NUTCH-1327 URL: https://issues.apache.org/jira/browse/NUTCH-1327 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.9 Attachments: NUTCH-1327-1.8-1.patch A normalizer for dealing with query strings. Sorting query strings is helpful in preventing duplicates for some (bad) websites. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1327) QueryStringNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1327: - Attachment: NUTCH-1327-1.8-2.patch Thanks! I always forget something! Here's a new one plus comment! QueryStringNormalizer - Key: NUTCH-1327 URL: https://issues.apache.org/jira/browse/NUTCH-1327 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.9 Attachments: NUTCH-1327-1.8-1.patch, NUTCH-1327-1.8-2.patch A normalizer for dealing with query strings. Sorting query strings is helpful in preventing duplicates for some (bad) websites. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira