lina dziri created NUTCH-2806:
---------------------------------

             Summary: Nutch can't parse links 
                 Key: NUTCH-2806
                 URL: https://issues.apache.org/jira/browse/NUTCH-2806
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 2.4
            Reporter: lina dziri
             Fix For: 2.4


Testing with the following site: 
[https://www.algeriahome.com|https://www.algeriahome.com/] , nutch only parse 
links that does contain the base url. 
 Tried tika as parser, tried to update db.max.outlinks.per.page to -1, tried 
practically every comments about detecting all the links, doubted urlfilter or 
regex-normalizer so it was disabled but having the same results. 
 each time I rebuild nutch and test the parser, it gives the same urls count 
arround 378. 
 Can somebody help out to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to