[ 
https://issues.apache.org/jira/browse/NUTCH-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1431.
-------------------------------

    Resolution: Fixed

committed
                
> Introduce link 'distance' and add configurable max distance in the generator
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-1431
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1431
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Ferdy Galema
>             Fix For: 2.1
>
>         Attachments: NUTCH-1431.patch
>
>
> Introducing a new feature that enables to crawl URLs within a specific 
> distance (shortest path) from the injected source urls. This is where the 
> db-updater of Nutchgora really shines. Because every url in the reducer has 
> all of its inlinks present, it is really easy to determine what the shortest 
> path is to that url. (I would not know how to cleanly implement this feature 
> for trunk).
> Injected urls have distance 0. Outlink urls on those pages have distance 1. 
> Outlinks on those pages have distance 2, etc. Outlinks that already had a 
> smaller distance will keep that distance. Of all inlinks to a page, it will 
> always select the smallest distance in order to maintain the shortest path 
> garantuee.
> Generator now has a property 'generate.max.distance' (default set to -1) that 
> specifies the maximum allowed distance of urls to select for fetch.
> Note that this is fundamentally different from the concept crawl 'depth'. 
> Depth is used for crawl cycles. Distance allows to crawl for unlimited number 
> of cycles AND always stay within a certain number of 'hops' from injected 
> urls.
> I will attach a patch. Will commit in a few days. (It does not change crawl 
> behaviour unless otherwise configured). Let me know if you have comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to