[
https://issues.apache.org/jira/browse/PDFBOX-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler reassigned PDFBOX-5683:
------------------------------------------
Assignee: Andreas Lehmkühler
> Inconsistent/incomplete PDF rendering
> -------------------------------------
>
> Key: PDFBOX-5683
> URL: https://issues.apache.org/jira/browse/PDFBOX-5683
> Project: PDFBox
> Issue Type: New Feature
> Components: Parsing
> Reporter: Patrick Dalla Bernardina
> Assignee: Andreas Lehmkühler
> Priority: Minor
> Labels: Forensic
> Attachments: pdf1.pdf
>
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> We have integrated tika and its default parsers in a Forensic Tool (IPED). As
> a forensic tool, it tries to recover/carve deleted PDF files, some of which
> can be partially recovered.
> PDFParser throws Exception if there is no PDF version header on the parsed
> content, avoiding any further content parsing.
> Commenting this exception, the parser still throws the exception "Missing
> root object specification in trail" in initialParse method, as this root
> object is normally at begin of a PDF file.
> Although, I could made some simple effort to build a "fake" root
> COSDictionary and build the PAGES entry with the recoverable PAGEs, searching
> them from document.getXrefTable();
> {code:java}
> protected void initialParse() throws IOException
> {
> COSDictionary trailer = retrieveTrailer();
>
> COSDictionary root = trailer.getCOSDictionary(COSName.ROOT);
> if (root == null)
> {
> // rebuild root from xref recovered info
> root = new COSDictionary();
> root.setItem(COSName.TYPE, COSName.CATALOG);
> trailer.setItem(COSName.ROOT, root);
> // identify recovered pages from xref to mount COSName.PAGES
> Map<COSObjectKey, Long> xrefTable = document.getXrefTable();
> COSArray kids = new COSArray();
> for (Entry<COSObjectKey, Long> e : xrefTable.entrySet()) {
> COSObject o = document.getObjectFromPool(e.getKey());
> if (o.getObject() instanceof COSDictionary) {
> COSDictionary d = (COSDictionary) o.getObject();
> COSName type = d.getCOSName(COSName.TYPE);
> if (type != null) {
> if (type.equals(COSName.PAGE)) {
> kids.add(d);
> }
> }
> }
> }
>
> COSDictionary pages = new COSDictionary();
> pages.setItem(COSName.TYPE, COSName.PAGES);
> pages.setItem(COSName.COUNT, COSInteger.get(kids.size()));
> pages.setItem(COSName.KIDS, kids);
> document.setDecrypted();
> root.setItem(COSName.PAGES, pages);
> initialParseDone = true;
> return;
> // throw new IOException("Missing root object specification in
> trailer.");
> }
> // in some pdfs the type value "Catalog" is missing in the root object
> if (isLenient() && !root.containsKey(COSName.TYPE))
> {
> root.setItem(COSName.TYPE, COSName.CATALOG);
> }
> // check pages dictionaries
> checkPages(root);
> document.setDecrypted();
> initialParseDone = true;
> }
> {code}
> This simple effort was enough to show the recoverable pages on PDFDebugger,
> export XMP metadatas, text, and get the rendered pages buffered images to use
> on my OCR module.
> So, it would be very useful if PDFBOX already have some optional
> parameterizable mode to open/recover inconsistent/incomplete pdf file, with
> at least the implementation above or further recover actions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]