[
https://issues.apache.org/jira/browse/NUTCH-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel reopened NUTCH-1106:
------------------------------------
Assignee: Sebastian Nagel (was: Markus Jelsma)
Reopened, see [discussion
@user|https://lists.apache.org/thread.html/0f316ce311087f6c366629b7334f4e975114622eff2550ea523fe666@%3Cuser.nutch.apache.org%3E].
The solution should include:
- filter by length in ParseOutputFormat
- controlled by a new property
- ev. also add an inactive (commented out) rule to regex-urlfilter.txt
> Options to skip url's based on length
> -------------------------------------
>
> Key: NUTCH-1106
> URL: https://issues.apache.org/jira/browse/NUTCH-1106
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.5, 1.15
>
> Attachments: NUTCH-1106-1.4-1.patch
>
>
> Adds option to skip URL's exceeding a certain length. At first we used regex
> to impose this limit but having this options configurable is more convenient.
> Comments?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)