lina dziri created NUTCH-2806:
---------------------------------
Summary: Nutch can't parse links
Key: NUTCH-2806
URL: https://issues.apache.org/jira/browse/NUTCH-2806
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 2.4
Reporter: lina dziri
Fix For: 2.4
Testing with the following site:
[https://www.algeriahome.com|https://www.algeriahome.com/] , nutch only parse
links that does contain the base url.
Tried tika as parser, tried to update db.max.outlinks.per.page to -1, tried
practically every comments about detecting all the links, doubted urlfilter or
regex-normalizer so it was disabled but having the same results.
each time I rebuild nutch and test the parser, it gives the same urls count
arround 378.
Can somebody help out to fix this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)