The organisation I work for currently uses poppler's pdfunite utility as part of our preservation system. We scan documents, run through ABBYY Recognition Server to generate a PDF for each page, and have been using pdfunite to join those files into multi-page PDFs which are available for download.
We recently started to investigate adopting JHOVE http://jhove.openpreservation.org/ which identifies and validates files including PDF files. JHOVE is indicating there are problems with the files we create with pdfunite as well as the files we previously created with `pdftk cat`. The thread in the JHOVE forum can be seen at http://lists.openpreservation.org/pipermail/jhove/2017-April/thread.html#3 A pdfunite generated file is available via http://pub.canadiana.ca/view/omcn.MississaugaNews_2 (download link beside the zoom buttons). With `pdftk cat` the problem happens after a certain size (around 795 pages from the sample pages I used). Using only two of those single-page PDF files as an example, I get the following with the latest release of pdfunite (compiled on Ubuntu 14.04): cihm@russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h xml MississaugaNews_2/0001.pdf | grep '<status' <status>Well-Formed and valid</status> cihm@russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h xml MississaugaNews_2/0002.pdf | grep '<status' <status>Well-Formed and valid</status> cihm@russell-desktop:/opt/wip/Temp/rwm$ /usr/local/bin/pdfunite MississaugaNews_2/0001.pdf MississaugaNews_2/0002.pdf pdfunite.pdf cihm@russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h xml pdfunite.pdf java.lang.ArrayIndexOutOfBoundsException: 60 at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398) at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377) at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344) at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521) at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803) at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588) at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455) at Jhove.main(Jhove.java:292) <?xml version="1.0" encoding="UTF-8"?> <jhove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://hul.harvard.edu/ois/xml/ns/jhove" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/jhove http://hul.harvard.edu/ois/xml/xsd/jhove/1.6/jhove.xsd" name="Jhove" release="1.16.5" date="2017-03-20"> <date>2017-04-07T12:22:35-04:00</date> <repInfo uri="pdfunite.pdf"> <reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule> <lastModified>2017-04-07T12:22:16-04:00</lastModified> <size>2888705</size> <format>PDF</format> <status>Not well-formed</status> <sigMatch> <module>PDF-hul</module> </sigMatch> <messages> <message offset="2888253" severity="error">46</message> <message offset="0" severity="error">No document catalog dictionary</message> </messages> <mimeType>application/pdf</mimeType> </repInfo> </jhove> cihm@russell-desktop:/opt/wip/Temp/rwm$ /usr/local/bin/pdfunite -v pdfunite version 0.53.0 Copyright 2005-2017 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC cihm@russell-desktop:/opt/wip/Temp/rwm$ Same with the older version of Poppler that is distributed with Ubuntu 14.04: cihm@russell-desktop:/opt/wip/Temp/rwm$ /usr/bin/pdfunite MississaugaNews_2/0001.pdf MississaugaNews_2/0002.pdf pdfunite.pdf cihm@russell-desktop:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h xml pdfunite.pdf java.lang.ArrayIndexOutOfBoundsException: 60 at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398) at edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377) at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344) at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521) at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803) at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588) at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455) at Jhove.main(Jhove.java:292) <?xml version="1.0" encoding="UTF-8"?> <jhove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://hul.harvard.edu/ois/xml/ns/jhove" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/jhove http://hul.harvard.edu/ois/xml/xsd/jhove/1.6/jhove.xsd" name="Jhove" release="1.16.5" date="2017-03-20"> <date>2017-04-07T12:25:50-04:00</date> <repInfo uri="pdfunite.pdf"> <reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule> <lastModified>2017-04-07T12:25:42-04:00</lastModified> <size>2885504</size> <format>PDF</format> <status>Not well-formed</status> <sigMatch> <module>PDF-hul</module> </sigMatch> <messages> <message offset="2885066" severity="error">46</message> <message offset="0" severity="error">No document catalog dictionary</message> </messages> <mimeType>application/pdf</mimeType> </repInfo> </jhove> cihm@russell-desktop:/opt/wip/Temp/rwm$ /usr/bin/pdfunite -v pdfunite version 0.24.5 Copyright 2005-2013 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC cihm@russell-desktop:/opt/wip/Temp/rwm$ Any suggestions? I can make the source PDF files available if that would help. cihm@russell-desktop:/opt/wip/Temp/rwm$ pdfinfo MississaugaNews_2/0001.pdf Producer: ABBYY Recognition Server CreationDate: Sun Mar 12 09:44:20 2017 EDT ModDate: Sun Mar 12 09:44:20 2017 EDT Tagged: yes UserProperties: no Suspects: no Form: none JavaScript: no Pages: 1 Encrypted: no Page size: 733.45 x 1486.1 pts Page rot: 0 File size: 2388234 bytes Optimized: no PDF version: 1.4 cihm@russell-desktop:/opt/wip/Temp/rwm$ pdfinfo MississaugaNews_2/0002.pdf Producer: ABBYY Recognition Server CreationDate: Sun Mar 12 09:43:59 2017 EDT ModDate: Sun Mar 12 09:43:59 2017 EDT Tagged: yes UserProperties: no Suspects: no Form: none JavaScript: no Pages: 1 Encrypted: no Page size: 783.35 x 1428.5 pts Page rot: 0 File size: 511205 bytes Optimized: no PDF version: 1.4 cihm@russell-desktop:/opt/wip/Temp/rwm$ -- System Administration and software developer, Canadiana.org http://www.canadiana.ca _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
