[jira] [Commented] (NUTCH-2806) Nutch can't parse links

Sebastian Nagel (Jira) Mon, 27 Jul 2020 02:11:12 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165557#comment-17165557
 ]


Sebastian Nagel commented on NUTCH-2806:
----------------------------------------

Hi [~immobilier-dz], could be also caused by http.content.limit which will by 
default in 2.4 only fetch the first 64 kiB of the page. If you increase the 
limit there are more links. You can test it by running
{noformat}
 $NUTCH_HOME/bin/nutch parsechecker -Dhttp.content.limit=-1 
https://www.algeriahome.com/{noformat}

> Nutch can't parse links 
> ------------------------
>
>                 Key: NUTCH-2806
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2806
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.4
>            Reporter: lina dziri
>            Priority: Major
>
> Testing with the following site: 
> [https://www.algeriahome.com|https://www.algeriahome.com/] , nutch only parse 
> links that does contain the base url. 
>  Tried tika as parser, tried to update db.max.outlinks.per.page to -1, tried 
> practically every comments about detecting all the links, doubted urlfilter 
> or regex-normalizer so it was disabled but having the same results. 
>  each time I rebuild nutch and test the parser, it gives the same urls count 
> arround 378. 
>  Can somebody help out to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2806) Nutch can't parse links

Reply via email to