[Nutch Wiki] Trivial Update of PluginCentral by LewisJohnMcgibbney

2013-07-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The PluginCentral page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/PluginCentral?action=diffrev1=83rev2=84

   * AboutPlugins - General information on what plugins are and how they work.
   * [[WhichTechnicalConceptsAreBehindTheNutchPluginSystem|Technical Concepts 
Behind the Nutch Plugin System]]
   * [[WhatsTheProblemWithPluginsAndClass-loading|Problems with Plugins and 
Class-Loading]] 
-  * WritingPluginExample - A step-by-step example of how to write a plugin for 
Nutch-1.3 
+  * WritingPluginExample - A step-by-step example of how to write a plugin 
using the 1.x API.
   * 
[[http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/|Writing a 
plugin to add dates]] by Ryan Pfister
   * PluginGotchas - Yep there are some Gotchas you need to consider.
   * TikaPlugin - Comments on the Tika integration and differences with 
existing parse plugins


[jira] [Resolved] (NUTCH-1593) normalize option missing in SegmentMerger's usage

2013-07-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1593.
--

Resolution: Fixed

Committed in trunk in rev. 1498346.

 normalize option missing in SegmentMerger's usage
 -

 Key: NUTCH-1593
 URL: https://issues.apache.org/jira/browse/NUTCH-1593
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.8

 Attachments: NUTCH-1593.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1581) CrawlDB csv output to include metadata

2013-07-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696691#comment-13696691
 ] 

Markus Jelsma commented on NUTCH-1581:
--

I'll commit this one unless there are objections. Thanks

 CrawlDB csv output to include metadata
 --

 Key: NUTCH-1581
 URL: https://issues.apache.org/jira/browse/NUTCH-1581
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1581-1.8.patch


 Dumping the CrawlDB to CSV should include the CrawlDatum's metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1327) QueryStringNormalizer

2013-07-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696705#comment-13696705
 ] 

Markus Jelsma commented on NUTCH-1327:
--

Any comments? Thanks

 QueryStringNormalizer
 -

 Key: NUTCH-1327
 URL: https://issues.apache.org/jira/browse/NUTCH-1327
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.9

 Attachments: NUTCH-1327-1.8-1.patch


 A normalizer for dealing with query strings. Sorting query strings is helpful 
 in preventing duplicates for some (bad) websites.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1593) normalize option missing in SegmentMerger's usage

2013-07-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696746#comment-13696746
 ] 

Hudson commented on NUTCH-1593:
---

Integrated in Nutch-trunk #2263 (See 
[https://builds.apache.org/job/Nutch-trunk/2263/])
NUTCH-1593 Normalize option missing in SegmentMerger's usage (Revision 
1498346)

 Result = SUCCESS
markus : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1498346
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java


 normalize option missing in SegmentMerger's usage
 -

 Key: NUTCH-1593
 URL: https://issues.apache.org/jira/browse/NUTCH-1593
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.8

 Attachments: NUTCH-1593.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Jenkins build is back to normal : Nutch-trunk #2263

2013-07-01 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/2263/changes



Adding nutch stage

2013-07-01 Thread Ahmet Emre Aladağ

Hi,

I'd like to add a new stage called updatescore after updatedb to 
Nutch 2.1.


I tried two ways for this:
1) public class ScoreUpdaterJob extends NutchTool implements Tool;

Nutch requires me to define the InputFormat, OutputFormat etc. to 
perform Map-reduce calculations.


I don't want to perform map-reduce but call a Giraph job to run on 
Hadoop. When it's finished, Nutch can go on its way.


2) public class ScoreUpdaterJob implements Tool;
or public class ScoreUpdaterJob;

Then I can't use setJarClass of NutchTool, so hadoop job fails:
Caused by: java.lang.ClassNotFoundException: 
org.apache.giraph.examples.LinkRank.LinkRankComputation


How can I fix this? What's the best way to add a giraph job as a Nutch 
stage?

Thanks,




[jira] [Commented] (NUTCH-1594) count variable is never changed in ParseUtil class

2013-07-01 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696798#comment-13696798
 ] 

lufeng commented on NUTCH-1594:
---

Committed @revision 1498437 in 2.x HEAD. Thanks Canan and Lewis.

 count variable is never changed in ParseUtil class
 --

 Key: NUTCH-1594
 URL: https://issues.apache.org/jira/browse/NUTCH-1594
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2
Reporter: lufeng
Assignee: lufeng
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1594.patch


 in ParseUtil class the count variable is never change. the code is like this 
 for (int i = 0; count  maxOutlinks  i  outlinks.length; i++) 
 so even if you define the db.max.outlinks.per.page parameter, it will not 
 take effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1327) QueryStringNormalizer

2013-07-01 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696840#comment-13696840
 ] 

Tejas Patil commented on NUTCH-1327:


Hi Markus,

1. The patch when applied as is didn't compile the plugin. I had to add entries 
into src/plugin/build.xml to get it compiled. 
2. Can you kindly add some javadoc comments in QuerystringURLNormalizer class 
so that people can quickly get an idea about what this plugin would do ?

 QueryStringNormalizer
 -

 Key: NUTCH-1327
 URL: https://issues.apache.org/jira/browse/NUTCH-1327
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.9

 Attachments: NUTCH-1327-1.8-1.patch


 A normalizer for dealing with query strings. Sorting query strings is helpful 
 in preventing duplicates for some (bad) websites.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1327) QueryStringNormalizer

2013-07-01 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696854#comment-13696854
 ] 

lufeng commented on NUTCH-1327:
---

Hi Markus, I tested you patch, Do you forget to add deploy and test target into 
src/plugin/build.xml?

+1 

 QueryStringNormalizer
 -

 Key: NUTCH-1327
 URL: https://issues.apache.org/jira/browse/NUTCH-1327
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.9

 Attachments: NUTCH-1327-1.8-1.patch


 A normalizer for dealing with query strings. Sorting query strings is helpful 
 in preventing duplicates for some (bad) websites.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1327) QueryStringNormalizer

2013-07-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1327:
-

Attachment: NUTCH-1327-1.8-2.patch

Thanks! I always forget something! Here's a new one plus comment!

 QueryStringNormalizer
 -

 Key: NUTCH-1327
 URL: https://issues.apache.org/jira/browse/NUTCH-1327
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.9

 Attachments: NUTCH-1327-1.8-1.patch, NUTCH-1327-1.8-2.patch


 A normalizer for dealing with query strings. Sorting query strings is helpful 
 in preventing duplicates for some (bad) websites.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira