[ 
https://issues.apache.org/jira/browse/NUTCH-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Aragon updated NUTCH-2742:
-------------------------------
     Attachment: crawl.log
                 segment-dump.txt
    Description: 
It appears that the Tika plugin is not parsing some PDF files.

An example is 
"https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansards/1b090c4f-e4d9-4785-a733-b5270139d035/toc_pdf/Senate_2019_02_12_6907_Official.pdf";

When I completed a dump of the segment data there is no content

 

EDIT: See attached for output and crawl log

 

  was:
It appears that the Tika plugin is not parsing some PDF files.

An example is 
"https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansards/1b090c4f-e4d9-4785-a733-b5270139d035/toc_pdf/Senate_2019_02_12_6907_Official.pdf";

When I completed a dump of the segment data there is no content

 

Output of dumped segment data:

 

Recno:: 0

URL:: 
[https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansardr/6cd30e15-83c4-4db4-bebc-e1033048fb66/toc_pdf/House%20of%20Representatives_2019_09_16_7162.pdf]

 

CrawlDatum::

Version: 7

Status: 1 (db_unfetched)

Fetch time: Mon Oct 07 00:00:37 AEDT 2019

Modified time: Thu Jan 01 10:00:00 AEST 1970

Retries since fetch: 0

Retry interval: 2592000 seconds (30 days)

Score: 1.0

Signature: null

Metadata: 

  _ngt_=1570366841510

 


> Unable to parse specific pdf file
> ---------------------------------
>
>                 Key: NUTCH-2742
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2742
>             Project: Nutch
>          Issue Type: Bug
>          Components: nutchNewbie, parser
>    Affects Versions: 1.15
>            Reporter: Mark Aragon
>            Priority: Minor
>         Attachments: crawl.log, segment-dump.txt
>
>
> It appears that the Tika plugin is not parsing some PDF files.
> An example is 
> "https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansards/1b090c4f-e4d9-4785-a733-b5270139d035/toc_pdf/Senate_2019_02_12_6907_Official.pdf";
> When I completed a dump of the segment data there is no content
>  
> EDIT: See attached for output and crawl log
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to