Re: [jira] [Commented] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

Markus Jelsma Mon, 19 Dec 2011 09:17:45 -0800
Thanks. Will commit tomorrow.

On Monday 19 December 2011 18:01:31 Julien Nioche (Commented) (JIRA) wrote:
>     [
> https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.p
> lugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172397#comm
> ent-13172397 ]
> 
> Julien Nioche commented on NUTCH-1184:
> --------------------------------------
> 
> Just managed to have a look and haven't seen any reason not to commit
> (disclaimer I haven't compiled or tested the code) Thanks
> 
> Julien
> 
> > Fetcher to parse and follow Nth degree outlinks
> > -----------------------------------------------
> > 
> >                 Key: NUTCH-1184
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-1184
> >             
> >             Project: Nutch
> >          
> >          Issue Type: New Feature
> >          Components: fetcher
> >          
> >            Reporter: Markus Jelsma
> >            Assignee: Markus Jelsma
> >            
> >             Fix For: 1.5
> >         
> >         Attachments: NUTCH-1184-1.5-1.patch, NUTCH-1184-1.5-2.patch,
> >         NUTCH-1184-1.5-3.patch, NUTCH-1184-1.5-4.patch,
> >         NUTCH-1184-1.5-5-ParseData.patch, NUTCH-1184-1.5-5.patch,
> >         NUTCH-1184-1.5-9-ParseOutputFormat.patch,
> >         NUTCH-1185-1.5-6.patch, NUTCH-1185-1.5-7.patch,
> >         NUTCH-1185-1.5-8.patch, NUTCH-1185-1.5-9.patch
> > 
> > Fetcher improvements to parse and follow outlinks up to a specified
> > depth. The number of outlinks to follow can be decreased by depth using
> > a divisor. This patch introduces three new configuration directives:
> > {code}
> > <property>
> > 
> >   <name>fetcher.follow.outlinks.depth</name>
> >   <value>-1</value>
> >   <description>(EXPERT)When fetcher.parse is true and this value is
> >   greater than 0 the fetcher will extract outlinks and follow until the
> >   desired depth is reached. A value of 1 means all generated pages are
> >   fetched and their first degree outlinks are fetched and parsed too. Be
> >   careful, this feature is in itself agnostic of the state of the
> >   CrawlDB and does not know about already fetched pages. A setting
> >   larger than 2 will most likely fetch home pages twice in the same
> >   fetch cycle. It is highly recommended to set db.ignore.external.links
> >   to true to restrict the outlink follower to URL's within the same
> >   domain. When disabled (false) the feature is likely to follow
> >   duplicates even when depth=1. A value of -1 of 0 disables this
> >   feature.
> >   </description>
> > 
> > </property>
> > <property>
> > 
> >   <name>fetcher.follow.outlinks.num.links</name>
> >   <value>4</value>
> >   <description>(EXPERT)The number of outlinks to follow when
> >   fetcher.follow.outlinks.depth is enabled. Be careful, this can
> >   multiply the total number of pages to fetch. This works with
> >   fetcher.follow.outlinks.depth.divisor, by default settings the
> >   followed outlinks at depth 1 is 8, not 4.
> >   </description>
> > 
> > </property>
> > <property>
> > 
> >   <name>fetcher.follow.outlinks.depth.divisor</name>
> >   <value>2</value>
> >   <description>(EXPERT)The divisor of fetcher.follow.outlinks.num.links
> >   per fetcher.follow.outlinks.depth. This decreases the number of
> >   outlinks to follow by increasing depth. The formula used is: outlinks
> >   = floor(divisor / depth * num.links). This prevents exponential growth
> >   of the fetch list.
> >   </description>
> > 
> > </property>
> > {code}
> > Please, do not use this unless you know what you're doing. This feature
> > does not consider the state of the CrawlDB nor does it consider
> > generator settings such as limiting the number of pages per
> > (domain|host|ip) queue. It is not polite to use this feature with high
> > settings as it can fetch many pages from the same domain including
> > duplicates.
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] [Commented] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

Reply via email to