This is an automated email from the ASF dual-hosted git repository.
snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.
from cf183ad Merge pull request #358 from sebastian-nagel/NUTCH-2071
add 579a76b NUTCH-1106 Options to skip url's based on length - add
property db.max.outlink.length to limit length of outlinks and redirects
(default = 8192 characters) - add rule (not active) to
regex-urlfilters.txt.template
add 8d434b5 NUTCH-1106 Options to skip url's based on length - most
browsers support URLs up to around 2048 characters - use this value for the
rule in regex-urlfilter.txt - limit outlink length to 4096 characters to allow
additional characters removed during normalization (anchor, query args)
new f263d91 Merge pull request #359 from
sebastian-nagel/NUTCH-1106-max-outlink-length
The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
conf/nutch-default.xml | 15 +++++++++++++++
conf/regex-urlfilter.txt.template | 3 +++
src/java/org/apache/nutch/fetcher/FetcherThread.java | 12 ++++++++++--
src/java/org/apache/nutch/parse/ParseOutputFormat.java | 8 +++++++-
4 files changed, 35 insertions(+), 3 deletions(-)