RE: Generic LinkRank plugin for Nutch

2013-05-29 Thread Markus Jelsma
Hi Ahmet!

This is really interesting and i'm very curious about performance improvements. 
For example, it can take many hours to calculate a billions of records for 10 
power iterations in 1.x! Please open a new issue at Jira whenever you're ready 
and consider copying your wiki page to the Nutch MoinMoin wiki, it is certainly 
going to be very useful!

Thanks,
Markus

 
 
-Original message-
> From:Ahmet Emre Aladağ 
> Sent: Wed 29-May-2013 10:00
> To: dev@nutch.apache.org
> Subject: Generic LinkRank plugin for Nutch
> 
> Hi,
> 
> I'm working on LinkRank Implementation with Giraph for both Nutch 1.x 
> and 2.x.  What I'm planning [1] is to get the outlink data and give it 
> as a graph to Giraph and perform LinkRank calculation . Then read the 
> results and inject them back to Nutch.
> 
> Summary of the task:
> 1. Get the outlinkDB, write it as [URL, URL] pairs on HDFS,
> 2. Write current (initial) scores from CrawlDB as [URL, Score] on HDFS.
> 3. Run Giraph LinkRank.
> 4. Read the resulting [URL, NewScore] pairs
> 5. Update CrawlDB with the new scores.
> 
> So the plugin will be like a proxy.
> 
> As far as I can see, ScoringFilter mechanism in 1.x requires 
> implementation of methods for urls one-by-one
> 
> Ex:
>public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData 
> parseData,
>Collection> targets, CrawlDatum adjust,
>int allCount) throws ScoringFilterException;
> 
> 
> But I'd like to write/read the whole db. Now I think that instead of a 
> ScoringFilter, I should write a generic plugin to achieve this. Should I 
> extend Pluggable? Could you give suggestions for what could be the best 
> way to achieve this? I'm starting with 1.x but will come for 2.x so 
> suggestions for both are welcomed
> 
> Thanks,
> 
> 
> [1] 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31820383
> 


[jira] [Commented] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669419#comment-13669419
 ] 

Hudson commented on NUTCH-1575:
---

Integrated in Nutch-nutchgora #623 (See 
[https://builds.apache.org/job/Nutch-nutchgora/623/])
NUTCH-1575 support solr authentication in nutch 2.x (Revision 1487521)

 Result = SUCCESS
fenglu : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1487521
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/nutch-default.xml
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrConstants.java
* 
/nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrIndexerJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrUtils.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrWriter.java


> support solr authentication in nutch 2.x
> 
>
> Key: NUTCH-1575
> URL: https://issues.apache.org/jira/browse/NUTCH-1575
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1575.patch
>
>
> can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-29 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669392#comment-13669392
 ] 

Lewis John McGibbney commented on NUTCH-1545:
-

Hi Feng, do you want to commit the fix for this? This is the issue that Chris 
is having on the mailing list and we really should get the fix in to 2.2.

> capture batchId and remove references to segments in 2.x crawl script.
> --
>
> Key: NUTCH-1545
> URL: https://issues.apache.org/jira/browse/NUTCH-1545
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch
>
>
> The concept of segment is replaced by batchId in 2.x
> I'm currently getting rid of segments references in 2.x
> This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1575.
---

Resolution: Fixed

> support solr authentication in nutch 2.x
> 
>
> Key: NUTCH-1575
> URL: https://issues.apache.org/jira/browse/NUTCH-1575
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1575.patch
>
>
> can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1575) support solr authentication in nutch 2.x

2013-05-29 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669351#comment-13669351
 ] 

lufeng commented on NUTCH-1575:
---

Committed for 2.2 revision 1487521 by Feng. Thanks Lewis

> support solr authentication in nutch 2.x
> 
>
> Key: NUTCH-1575
> URL: https://issues.apache.org/jira/browse/NUTCH-1575
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1575.patch
>
>
> can solr authentication in nutch 2.x like 1.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng resolved NUTCH-1563.
---

Resolution: Fixed

> FetchSchedule#getFields is never used by GeneraterJob
> -
>
> Key: NUTCH-1563
> URL: https://issues.apache.org/jira/browse/NUTCH-1563
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1563.patch
>
>
> The method of getFields in FetchSchedule if never used, so if user extends 
> the FetchSchedule and want to get some fields of WebPage, it always return 
> null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-29 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng closed NUTCH-1563.
-


> FetchSchedule#getFields is never used by GeneraterJob
> -
>
> Key: NUTCH-1563
> URL: https://issues.apache.org/jira/browse/NUTCH-1563
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1563.patch
>
>
> The method of getFields in FetchSchedule if never used, so if user extends 
> the FetchSchedule and want to get some fields of WebPage, it always return 
> null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Generic LinkRank plugin for Nutch

2013-05-29 Thread Ahmet Emre Aladağ

Hi,

I'm working on LinkRank Implementation with Giraph for both Nutch 1.x 
and 2.x.  What I'm planning [1] is to get the outlink data and give it 
as a graph to Giraph and perform LinkRank calculation . Then read the 
results and inject them back to Nutch.


Summary of the task:
1. Get the outlinkDB, write it as [URL, URL] pairs on HDFS,
2. Write current (initial) scores from CrawlDB as [URL, Score] on HDFS.
3. Run Giraph LinkRank.
4. Read the resulting [URL, NewScore] pairs
5. Update CrawlDB with the new scores.

So the plugin will be like a proxy.

As far as I can see, ScoringFilter mechanism in 1.x requires 
implementation of methods for urls one-by-one


Ex:
  public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData 
parseData,

  Collection> targets, CrawlDatum adjust,
  int allCount) throws ScoringFilterException;


But I'd like to write/read the whole db. Now I think that instead of a 
ScoringFilter, I should write a generic plugin to achieve this. Should I 
extend Pluggable? Could you give suggestions for what could be the best 
way to achieve this? I'm starting with 1.x but will come for 2.x so 
suggestions for both are welcomed


Thanks,


[1] 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31820383