hakim created NUTCH-2377:
----------------------------
Summary: Nutch can't parse relative links
Key: NUTCH-2377
URL: https://issues.apache.org/jira/browse/NUTCH-2377
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 2.3
Environment: centos 7, hbase 0.98
Reporter: hakim
Priority: Critical
Testing with the following site: https://www.ouedkniss.com, nutch only parse
links that does contain the base url.
Tried tika as parser, trying to update db.max.outlinks.per.page to -1, I tried
practically every comments about detecting all the links, I doubted urlfilter
or regex-normalizer so I disabled them but the same result.
each time I rebuild nutch and test the parser, it gives the same urls count
arround 378.
Can somebody help out to fix this.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)