[
https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1184:
---------------------------------
Attachment: NUTCH-1185-1.5-6.patch
New patch includes all involved files:
* ParseData
* ParseOutputFormat
* Fetcher
* nutch-default
It also adds a divisor to control the number of outlinks selected by depth. It
also includes two new reporters for outlinks (detected and followed) plus a
reported for the number of downloaded bytes.
> Fetcher to parse and follow Nth degree outlinks
> -----------------------------------------------
>
> Key: NUTCH-1184
> URL: https://issues.apache.org/jira/browse/NUTCH-1184
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.5
>
> Attachments: NUTCH-1184-1.5-1.patch, NUTCH-1184-1.5-2.patch,
> NUTCH-1184-1.5-3.patch, NUTCH-1184-1.5-4.patch,
> NUTCH-1184-1.5-5-ParseData.patch, NUTCH-1184-1.5-5.patch,
> NUTCH-1185-1.5-6.patch
>
>
> Fetcher improvements to parse and follow outlinks up to a specified depth.
> The number of outlinks to follow can be decreased by depth using a divisor.
> This patch introduces three new configuration directives:
> {code}
> <property>
> <name>fetcher.follow.outlinks.depth</name>
> <value>-1</value>
> <description>(EXPERT)When fetcher.parse is true and this value is greater
> than 0 the fetcher will extract outlinks
> and follow until the desired depth is reached. A value of 1 means all
> generated pages are fetched and their first degree
> outlinks are fetched and parsed too. Be careful, this feature is in itself
> agnostic of the state of the CrawlDB and does not
> know about already fetched pages. A setting larger than 2 will most likely
> fetch home pages twice in the same fetch cycle.
> A value of -1 of 0 disables this feature.
> </description>
> </property>
> <property>
> <name>fetcher.follow.outlinks.num.links</name>
> <value>4</value>
> <description>(EXPERT)The number of outlinks to follow when
> fetcher.follow.outlinks.depth is enabled. Be careful, this can multiply
> the total number of pages to fetch. This works with
> fetcher.follow.outlinks.depth.divisor, by default settings the followed
> outlinks
> at depth 1 is 8, not 4.
> </description>
> </property>
> <property>
> <name>fetcher.follow.outlinks.depth.divisor</name>
> <value>2</value>
> <description>(EXPERT)The divisor of fetcher.follow.outlinks.num.links per
> fetcher.follow.outlinks.depth. This decreases the number
> of outlinks to follow by increasing depth. The formula used is: outlinks =
> floor(divisor / depth * num.links). This prevents
> exponential growth of the fetch list.
> </description>
> </property>
> {code}
> Please, do not use this unless you know what you're doing. This feature does
> not consider the state of the CrawlDB nor does it consider generator settings
> such as limiting the number of pages per (domain|host|ip) queue. It is not
> polite to use this feature with high settings as it can fetch many pages from
> the same domain including duplicates.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira