[ 
https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605091#action_12605091
 ] 

Andrzej Bialecki  commented on NUTCH-635:
-----------------------------------------

This patch looks great! A few comments:

* in OutlinksDb.reduce() you use a simple assignment mostRecent = next. This 
doesn't work as expected, because Hadoop iterator reuses the same single 
instance of Outlinks under the hood, so if you keep a reference to it its value 
will mysteriously change under your feet as you call values.next(). This should 
be replaced with a deep copy (or clone) of the instance, either through a 
dedicated method of Outlinks or WritableUtils.copy().

* you should avoid spurious whitespace changes to existing classes, this makes 
the reading more difficult ... (e.g. Outlink.java)

* in Outlinks.write() I think there's a bug - you write out 
System.currentTimeMillis() instead of this.timestamp, is this intentional?

* in LinkAnalysis.Counter.map() , since you output static values, you should 
avoid creating new instances and use a pair of static instances.

* by the way, in an implementation of similar algo I used Hadoop Counters to 
count the totals, this way you avoid storing magic numbers in the db itself 
(although you still need to preserve them somewhere, so I'd create an 
additional file with this value ... well, perhaps not so elegant either after 
all ;) ).

* LinkAnalysis.Analyzer.reduce() - you should retrieve config parameters in 
configure(Job), otherwise you pay the price of getting floats from 
Configuration (which involves repeated creation of Float via 
Float.parseFloat()). Also, HashPartitioner should be created once. Well, this 
is a general comment to this patch - it creates a lot of objects unnecessarily. 
We can optimize it now or later, whatever you prefer.

I didn't go into the algorithm itself yet to give any useful comments ... But I 
have a dataset of ~4mln pages I can test it on.



> LinkAnalysis Tool for Nutch
> ---------------------------
>
>                 Key: NUTCH-635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-635
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch
>
>
> This is a basic pagerank type link analysis tool for nutch which simulates a 
> sparse matrix using inlinks and outlinks and converges after a given number 
> of iterations.  This tool is mean to replace the current scoring system in 
> nutch with a system that converges instead of exponentially increasing 
> scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to