Yossi Tamari commented on NUTCH-2742:

[~Mark A] The important line in crawl.log is:
This means that the URL was not fetched because the robots.txt file of the 
domain forbids it.

This is a feature, not a bug, so I suggest you close the ticket.

If you have questions regarding Nutch, I suggest asking them on the 
[u...@nutch.apache.org|mailto:u...@nutch.apache.org] mailing list, see 

> Unable to parse specific pdf file
> ---------------------------------
>                 Key: NUTCH-2742
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2742
>             Project: Nutch
>          Issue Type: Bug
>          Components: nutchNewbie, parser
>    Affects Versions: 1.15
>            Reporter: Mark Aragon
>            Priority: Minor
>         Attachments: crawl.log, segment-dump.txt
> It appears that the Tika plugin is not parsing some PDF files.
> An example is 
> "https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansards/1b090c4f-e4d9-4785-a733-b5270139d035/toc_pdf/Senate_2019_02_12_6907_Official.pdf";
> When I completed a dump of the segment data there is no content
> EDIT: See attached for output and crawl log

This message was sent by Atlassian Jira

Reply via email to