Re: Generic LinkRank plugin for Nutch

2013-05-30 Thread Julien Nioche
Hi Ahmet,

You don't need to use the ScoringFilters at all.  The
nutch.scoring.webgraph package can be taken as an example of how to do. It
works fine as far as I know but what we wanted with the Giraph-based
replacement was to have less code to maintain and also have something we
could use in 2.x straight away. If there are performance improvements as
well, all the better for it!

Thanks

Julien


On 29 May 2013 09:00, Ahmet Emre Aladağ emre.ala...@agmlab.com wrote:

 Hi,

 I'm working on LinkRank Implementation with Giraph for both Nutch 1.x and
 2.x.  What I'm planning [1] is to get the outlink data and give it as a
 graph to Giraph and perform LinkRank calculation . Then read the results
 and inject them back to Nutch.

 Summary of the task:
 1. Get the outlinkDB, write it as [URL, URL] pairs on HDFS,
 2. Write current (initial) scores from CrawlDB as [URL, Score] on HDFS.
 3. Run Giraph LinkRank.
 4. Read the resulting [URL, NewScore] pairs
 5. Update CrawlDB with the new scores.

 So the plugin will be like a proxy.

 As far as I can see, ScoringFilter mechanism in 1.x requires
 implementation of methods for urls one-by-one

 Ex:
   public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData
 parseData,
   CollectionEntryText, CrawlDatum targets, CrawlDatum adjust,
   int allCount) throws ScoringFilterException;


 But I'd like to write/read the whole db. Now I think that instead of a
 ScoringFilter, I should write a generic plugin to achieve this. Should I
 extend Pluggable? Could you give suggestions for what could be the best way
 to achieve this? I'm starting with 1.x but will come for 2.x so suggestions
 for both are welcomed

 Thanks,


 [1] https://cwiki.apache.org/**confluence/pages/viewpage.**
 action?pageId=31820383https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31820383




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Updated] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-30 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1545:
--

Fix Version/s: (was: 2.3)
   2.2

 capture batchId and remove references to segments in 2.x crawl script.
 --

 Key: NUTCH-1545
 URL: https://issues.apache.org/jira/browse/NUTCH-1545
 Project: Nutch
  Issue Type: Task
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch


 The concept of segment is replaced by batchId in 2.x
 I'm currently getting rid of segments references in 2.x
 This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-30 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13670376#comment-13670376
 ] 

lufeng commented on NUTCH-1545:
---

Committed for nutch 2.2 revision 1487875. by Feng. Thanks Tejas and Lewis.

 capture batchId and remove references to segments in 2.x crawl script.
 --

 Key: NUTCH-1545
 URL: https://issues.apache.org/jira/browse/NUTCH-1545
 Project: Nutch
  Issue Type: Task
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch


 The concept of segment is replaced by batchId in 2.x
 I'm currently getting rid of segments references in 2.x
 This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-30 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1545.
---

Resolution: Fixed

 capture batchId and remove references to segments in 2.x crawl script.
 --

 Key: NUTCH-1545
 URL: https://issues.apache.org/jira/browse/NUTCH-1545
 Project: Nutch
  Issue Type: Task
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch


 The concept of segment is replaced by batchId in 2.x
 I'm currently getting rid of segments references in 2.x
 This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13670399#comment-13670399
 ] 

Hudson commented on NUTCH-1545:
---

Integrated in Nutch-nutchgora #625 (See 
[https://builds.apache.org/job/Nutch-nutchgora/625/])
NUTCH-1545 capture batchId and remove references to segments in 2.x crawl 
script. (Revision 1487875)

 Result = SUCCESS
fenglu : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1487875
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/src/bin/crawl
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java


 capture batchId and remove references to segments in 2.x crawl script.
 --

 Key: NUTCH-1545
 URL: https://issues.apache.org/jira/browse/NUTCH-1545
 Project: Nutch
  Issue Type: Task
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch


 The concept of segment is replaced by batchId in 2.x
 I'm currently getting rid of segments references in 2.x
 This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1576) Need to keep hotStore.flush() exception catching

2013-05-30 Thread James Sullivan (JIRA)
James Sullivan created NUTCH-1576:
-

 Summary: Need to keep hotStore.flush() exception catching
 Key: NUTCH-1576
 URL: https://issues.apache.org/jira/browse/NUTCH-1576
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2
Reporter: James Sullivan
Priority: Minor


Still need exception checking for hoststorelflush() for those who have to use 
gora-core 0.2.1 otherwise Nutch 2.x will not compile.

!-- Uncomment this to use SQL as Gora backend. It should be noted that the 
gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. 
Users should 
downgrade to gora-core 0.2.1 in order to use SQL as a backend. --


Index: src/java/org/apache/nutch/host/HostDb.java
===
--- java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java   
(revision 1487824)
+++ java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java   
(working copy)
@@ -87,7 +87,11 @@
 CacheHost removeFromCacheHost = notification.getValue();
 if (removeFromCacheHost != NULL_HOST) {
   if (removeFromCacheHost.timestamp  lastFlush.get()) {
-hostStore.flush();
+try {
+  hostStore.flush();
+} catch (IOException e) {
+  throw new RuntimeException(e);
+}
 lastFlush.set(System.currentTimeMillis());
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1576) Need to keep hotStore.flush() exception catching

2013-05-30 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1576:
--

Attachment: patch.txt

 Need to keep hotStore.flush() exception catching
 

 Key: NUTCH-1576
 URL: https://issues.apache.org/jira/browse/NUTCH-1576
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2
Reporter: James Sullivan
Priority: Minor
 Attachments: patch.txt


 Still need exception checking for hoststorelflush() for those who have to use 
 gora-core 0.2.1 otherwise Nutch 2.x will not compile.
 !-- Uncomment this to use SQL as Gora backend. It should be noted that the 
 gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. 
 Users should 
 downgrade to gora-core 0.2.1 in order to use SQL as a backend. --
 Index: src/java/org/apache/nutch/host/HostDb.java
 ===
 --- java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java 
 (revision 1487824)
 +++ java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java 
 (working copy)
 @@ -87,7 +87,11 @@
  CacheHost removeFromCacheHost = notification.getValue();
  if (removeFromCacheHost != NULL_HOST) {
if (removeFromCacheHost.timestamp  lastFlush.get()) {
 -hostStore.flush();
 +try {
 +  hostStore.flush();
 +} catch (IOException e) {
 +  throw new RuntimeException(e);
 +}
  lastFlush.set(System.currentTimeMillis());
}
  }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira