Thanks. Will commit tomorrow.
On Monday 19 December 2011 18:01:31 Julien Nioche (Commented) (JIRA) wrote:
> [
> https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.p
> lugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172397#comm
> ent-13172397 ]
>
> Julien Nioche commented on NUTCH-1184:
> --------------------------------------
>
> Just managed to have a look and haven't seen any reason not to commit
> (disclaimer I haven't compiled or tested the code) Thanks
>
> Julien
>
> > Fetcher to parse and follow Nth degree outlinks
> > -----------------------------------------------
> >
> > Key: NUTCH-1184
> > URL: https://issues.apache.org/jira/browse/NUTCH-1184
> >
> > Project: Nutch
> >
> > Issue Type: New Feature
> > Components: fetcher
> >
> > Reporter: Markus Jelsma
> > Assignee: Markus Jelsma
> >
> > Fix For: 1.5
> >
> > Attachments: NUTCH-1184-1.5-1.patch, NUTCH-1184-1.5-2.patch,
> > NUTCH-1184-1.5-3.patch, NUTCH-1184-1.5-4.patch,
> > NUTCH-1184-1.5-5-ParseData.patch, NUTCH-1184-1.5-5.patch,
> > NUTCH-1184-1.5-9-ParseOutputFormat.patch,
> > NUTCH-1185-1.5-6.patch, NUTCH-1185-1.5-7.patch,
> > NUTCH-1185-1.5-8.patch, NUTCH-1185-1.5-9.patch
> >
> > Fetcher improvements to parse and follow outlinks up to a specified
> > depth. The number of outlinks to follow can be decreased by depth using
> > a divisor. This patch introduces three new configuration directives:
> > {code}
> > <property>
> >
> > <name>fetcher.follow.outlinks.depth</name>
> > <value>-1</value>
> > <description>(EXPERT)When fetcher.parse is true and this value is
> > greater than 0 the fetcher will extract outlinks and follow until the
> > desired depth is reached. A value of 1 means all generated pages are
> > fetched and their first degree outlinks are fetched and parsed too. Be
> > careful, this feature is in itself agnostic of the state of the
> > CrawlDB and does not know about already fetched pages. A setting
> > larger than 2 will most likely fetch home pages twice in the same
> > fetch cycle. It is highly recommended to set db.ignore.external.links
> > to true to restrict the outlink follower to URL's within the same
> > domain. When disabled (false) the feature is likely to follow
> > duplicates even when depth=1. A value of -1 of 0 disables this
> > feature.
> > </description>
> >
> > </property>
> > <property>
> >
> > <name>fetcher.follow.outlinks.num.links</name>
> > <value>4</value>
> > <description>(EXPERT)The number of outlinks to follow when
> > fetcher.follow.outlinks.depth is enabled. Be careful, this can
> > multiply the total number of pages to fetch. This works with
> > fetcher.follow.outlinks.depth.divisor, by default settings the
> > followed outlinks at depth 1 is 8, not 4.
> > </description>
> >
> > </property>
> > <property>
> >
> > <name>fetcher.follow.outlinks.depth.divisor</name>
> > <value>2</value>
> > <description>(EXPERT)The divisor of fetcher.follow.outlinks.num.links
> > per fetcher.follow.outlinks.depth. This decreases the number of
> > outlinks to follow by increasing depth. The formula used is: outlinks
> > = floor(divisor / depth * num.links). This prevents exponential growth
> > of the fetch list.
> > </description>
> >
> > </property>
> > {code}
> > Please, do not use this unless you know what you're doing. This feature
> > does not consider the state of the CrawlDB nor does it consider
> > generator settings such as limiting the number of pages per
> > (domain|host|ip) queue. It is not polite to use this feature with high
> > settings as it can fetch many pages from the same domain including
> > duplicates.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira