[ https://issues.apache.org/jira/browse/NUTCH-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mark Aragon updated NUTCH-2742: ------------------------------- Description: It appears that the Tika plugin is not parsing some PDF files. An example is "https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansards/1b090c4f-e4d9-4785-a733-b5270139d035/toc_pdf/Senate_2019_02_12_6907_Official.pdf" When I completed a dump of the segment data there is no content Output of dumped segment data: Recno:: 0 URL:: [https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansardr/6cd30e15-83c4-4db4-bebc-e1033048fb66/toc_pdf/House%20of%20Representatives_2019_09_16_7162.pdf] CrawlDatum:: Version: 7 Status: 1 (db_unfetched) Fetch time: Mon Oct 07 00:00:37 AEDT 2019 Modified time: Thu Jan 01 10:00:00 AEST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_=1570366841510 was: It appears that the Tika plugin is not parsing some PDF files. An example is "https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansards/1b090c4f-e4d9-4785-a733-b5270139d035/toc_pdf/Senate_2019_02_12_6907_Official.pdf" When I completed a dump of the segment data there is no content ``` Recno:: 0 URL:: [https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansardr/6cd30e15-83c4-4db4-bebc-e1033048fb66/toc_pdf/House%20of%20Representatives_2019_09_16_7162.pdf] CrawlDatum:: Version: 7 Status: 1 (db_unfetched) Fetch time: Mon Oct 07 00:00:37 AEDT 2019 Modified time: Thu Jan 01 10:00:00 AEST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_=1570366841510 ``` > Unable to parse specific pdf file > --------------------------------- > > Key: NUTCH-2742 > URL: https://issues.apache.org/jira/browse/NUTCH-2742 > Project: Nutch > Issue Type: Bug > Components: nutchNewbie, parser > Affects Versions: 1.15 > Reporter: Mark Aragon > Priority: Minor > > It appears that the Tika plugin is not parsing some PDF files. > An example is > "https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansards/1b090c4f-e4d9-4785-a733-b5270139d035/toc_pdf/Senate_2019_02_12_6907_Official.pdf" > When I completed a dump of the segment data there is no content > > Output of dumped segment data: > > Recno:: 0 > URL:: > [https://parlinfo.aph.gov.au/parlInfo/download/chamber/hansardr/6cd30e15-83c4-4db4-bebc-e1033048fb66/toc_pdf/House%20of%20Representatives_2019_09_16_7162.pdf] > > CrawlDatum:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Mon Oct 07 00:00:37 AEDT 2019 > Modified time: Thu Jan 01 10:00:00 AEST 1970 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > Metadata: > _ngt_=1570366841510 > -- This message was sent by Atlassian Jira (v8.3.4#803005)