[ 
https://issues.apache.org/jira/browse/PDFBOX-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765283#comment-17765283
 ] 

Andreas Lehmkühler commented on PDFBOX-5683:
--------------------------------------------

PDFBox already has a couple of self-healing mechanisms, to find the root 
dictionary if the trailer is damaged is one them. But in your case it isn't 
triggered as the version info is mandatory for PDFBox. We may change that, but 
we have to keep in mind that PDFBox won't detect non-pdf documents anymore so 
that at least theoratically we have to deal with any kind of file and maybe we 
have to harden the parser so that it won't run into any bigger trouble like 
endless loops or just extreme long running attempts to open certain files. On 
the other hand we may consider to add some sort of a switch to optionally 
activate that specific feature.

However, after changing the exception to a simple debug log PDFBox is able to 
find the root object but struggles with some page issue. I have to dig deeper 
into that.

The other idea to search for missing fonts, color spaces will most like not 
work. Either the dictionary for the resources including the name-mapping is 
somehow damaged or the object itself is somehow damaged. In the first case we 
won't find the object as the mapping is missing and in the second the mapping 
is intact but the object is missing. The object itself doesn't know the name 
which is used to address it.



> Inconsistent/incomplete PDF rendering
> -------------------------------------
>
>                 Key: PDFBOX-5683
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5683
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Patrick Dalla Bernardina
>            Priority: Minor
>              Labels: Forensic
>         Attachments: pdf1.pdf
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> We have integrated tika and its default parsers in a Forensic Tool (IPED). As 
> a forensic tool, it tries to recover/carve deleted PDF files, some of which 
> can be partially recovered.
> PDFParser throws Exception if there is no PDF version header on the parsed 
> content, avoiding any further content parsing.
> Commenting this exception, the parser still throws the exception "Missing 
> root object specification in trail" in initialParse method, as this root 
> object is normally at begin of a PDF file.
> Although, I could made some simple effort to build a "fake" root 
> COSDictionary and build the PAGES entry with the recoverable PAGEs, searching 
> them from document.getXrefTable();
> {code:java}
>     protected void initialParse() throws IOException
>     {
>         COSDictionary trailer = retrieveTrailer();
>     
>         COSDictionary root = trailer.getCOSDictionary(COSName.ROOT);
>         if (root == null)
>         {
>             // rebuild root from xref recovered info
>             root = new COSDictionary();
>             root.setItem(COSName.TYPE, COSName.CATALOG);
>             trailer.setItem(COSName.ROOT, root);
>             // identify recovered pages from xref to mount COSName.PAGES
>             Map<COSObjectKey, Long> xrefTable = document.getXrefTable();
>             COSArray kids = new COSArray();
>             for (Entry<COSObjectKey, Long> e : xrefTable.entrySet()) {
>                 COSObject o = document.getObjectFromPool(e.getKey());
>                 if (o.getObject() instanceof COSDictionary) {
>                     COSDictionary d = (COSDictionary) o.getObject();
>                     COSName type = d.getCOSName(COSName.TYPE);
>                     if (type != null) {
>                         if (type.equals(COSName.PAGE)) {
>                             kids.add(d);
>                         }
>                     }
>                 }
>             }
>             
>             COSDictionary pages = new COSDictionary();
>             pages.setItem(COSName.TYPE, COSName.PAGES);
>             pages.setItem(COSName.COUNT, COSInteger.get(kids.size()));
>             pages.setItem(COSName.KIDS, kids);
>             document.setDecrypted();
>             root.setItem(COSName.PAGES, pages);
>             initialParseDone = true;
>             return;
>             // throw new IOException("Missing root object specification in 
> trailer.");
>         }
>         // in some pdfs the type value "Catalog" is missing in the root object
>         if (isLenient() && !root.containsKey(COSName.TYPE))
>         {
>             root.setItem(COSName.TYPE, COSName.CATALOG);
>         }
>         // check pages dictionaries
>         checkPages(root);
>         document.setDecrypted();
>         initialParseDone = true;
>     }
> {code}
> This simple effort was enough to show the recoverable pages on PDFDebugger, 
> export XMP metadatas, text, and get the rendered pages buffered images to use 
> on my OCR module.
> So, it would be very useful if PDFBOX already have some optional 
> parameterizable mode to open/recover inconsistent/incomplete pdf file, with 
> at least the implementation above or further recover actions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to