Re: Generic LinkRank plugin for Nutch
Hi Ahmet, You don't need to use the ScoringFilters at all. The nutch.scoring.webgraph package can be taken as an example of how to do. It works fine as far as I know but what we wanted with the Giraph-based replacement was to have less code to maintain and also have something we could use in 2.x straight away. If there are performance improvements as well, all the better for it! Thanks Julien On 29 May 2013 09:00, Ahmet Emre Aladağ emre.ala...@agmlab.com wrote: Hi, I'm working on LinkRank Implementation with Giraph for both Nutch 1.x and 2.x. What I'm planning [1] is to get the outlink data and give it as a graph to Giraph and perform LinkRank calculation . Then read the results and inject them back to Nutch. Summary of the task: 1. Get the outlinkDB, write it as [URL, URL] pairs on HDFS, 2. Write current (initial) scores from CrawlDB as [URL, Score] on HDFS. 3. Run Giraph LinkRank. 4. Read the resulting [URL, NewScore] pairs 5. Update CrawlDB with the new scores. So the plugin will be like a proxy. As far as I can see, ScoringFilter mechanism in 1.x requires implementation of methods for urls one-by-one Ex: public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData parseData, CollectionEntryText, CrawlDatum targets, CrawlDatum adjust, int allCount) throws ScoringFilterException; But I'd like to write/read the whole db. Now I think that instead of a ScoringFilter, I should write a generic plugin to achieve this. Should I extend Pluggable? Could you give suggestions for what could be the best way to achieve this? I'm starting with 1.x but will come for 2.x so suggestions for both are welcomed Thanks, [1] https://cwiki.apache.org/**confluence/pages/viewpage.** action?pageId=31820383https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31820383 -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
[jira] [Updated] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1545: -- Fix Version/s: (was: 2.3) 2.2 capture batchId and remove references to segments in 2.x crawl script. -- Key: NUTCH-1545 URL: https://issues.apache.org/jira/browse/NUTCH-1545 Project: Nutch Issue Type: Task Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch The concept of segment is replaced by batchId in 2.x I'm currently getting rid of segments references in 2.x This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13670376#comment-13670376 ] lufeng commented on NUTCH-1545: --- Committed for nutch 2.2 revision 1487875. by Feng. Thanks Tejas and Lewis. capture batchId and remove references to segments in 2.x crawl script. -- Key: NUTCH-1545 URL: https://issues.apache.org/jira/browse/NUTCH-1545 Project: Nutch Issue Type: Task Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Priority: Minor Fix For: 2.3 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch The concept of segment is replaced by batchId in 2.x I'm currently getting rid of segments references in 2.x This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1545. --- Resolution: Fixed capture batchId and remove references to segments in 2.x crawl script. -- Key: NUTCH-1545 URL: https://issues.apache.org/jira/browse/NUTCH-1545 Project: Nutch Issue Type: Task Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch The concept of segment is replaced by batchId in 2.x I'm currently getting rid of segments references in 2.x This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13670399#comment-13670399 ] Hudson commented on NUTCH-1545: --- Integrated in Nutch-nutchgora #625 (See [https://builds.apache.org/job/Nutch-nutchgora/625/]) NUTCH-1545 capture batchId and remove references to segments in 2.x crawl script. (Revision 1487875) Result = SUCCESS fenglu : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1487875 Files : * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/bin/crawl * /nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java capture batchId and remove references to segments in 2.x crawl script. -- Key: NUTCH-1545 URL: https://issues.apache.org/jira/browse/NUTCH-1545 Project: Nutch Issue Type: Task Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch The concept of segment is replaced by batchId in 2.x I'm currently getting rid of segments references in 2.x This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1576) Need to keep hotStore.flush() exception catching
James Sullivan created NUTCH-1576: - Summary: Need to keep hotStore.flush() exception catching Key: NUTCH-1576 URL: https://issues.apache.org/jira/browse/NUTCH-1576 Project: Nutch Issue Type: Bug Affects Versions: 2.2 Reporter: James Sullivan Priority: Minor Still need exception checking for hoststorelflush() for those who have to use gora-core 0.2.1 otherwise Nutch 2.x will not compile. !-- Uncomment this to use SQL as Gora backend. It should be noted that the gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. Users should downgrade to gora-core 0.2.1 in order to use SQL as a backend. -- Index: src/java/org/apache/nutch/host/HostDb.java === --- java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java (revision 1487824) +++ java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java (working copy) @@ -87,7 +87,11 @@ CacheHost removeFromCacheHost = notification.getValue(); if (removeFromCacheHost != NULL_HOST) { if (removeFromCacheHost.timestamp lastFlush.get()) { -hostStore.flush(); +try { + hostStore.flush(); +} catch (IOException e) { + throw new RuntimeException(e); +} lastFlush.set(System.currentTimeMillis()); } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1576) Need to keep hotStore.flush() exception catching
[ https://issues.apache.org/jira/browse/NUTCH-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Sullivan updated NUTCH-1576: -- Attachment: patch.txt Need to keep hotStore.flush() exception catching Key: NUTCH-1576 URL: https://issues.apache.org/jira/browse/NUTCH-1576 Project: Nutch Issue Type: Bug Affects Versions: 2.2 Reporter: James Sullivan Priority: Minor Attachments: patch.txt Still need exception checking for hoststorelflush() for those who have to use gora-core 0.2.1 otherwise Nutch 2.x will not compile. !-- Uncomment this to use SQL as Gora backend. It should be noted that the gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. Users should downgrade to gora-core 0.2.1 in order to use SQL as a backend. -- Index: src/java/org/apache/nutch/host/HostDb.java === --- java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java (revision 1487824) +++ java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java (working copy) @@ -87,7 +87,11 @@ CacheHost removeFromCacheHost = notification.getValue(); if (removeFromCacheHost != NULL_HOST) { if (removeFromCacheHost.timestamp lastFlush.get()) { -hostStore.flush(); +try { + hostStore.flush(); +} catch (IOException e) { + throw new RuntimeException(e); +} lastFlush.set(System.currentTimeMillis()); } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira