[ 
https://issues.apache.org/jira/browse/PDFBOX-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765590#comment-17765590
 ] 

Patrick Dalla Bernardina edited comment on PDFBOX-5683 at 9/15/23 1:29 PM:
---------------------------------------------------------------------------

Yes, Working with invalid/inconsistent and unpredictable data could lead to 
endless loops. But, forensically, it would be worthwhile to recover as much 
info as possible. Our tool works with timeouts, to avoid/skip these kind of 
problems.

"The other idea to search for missing fonts, color spaces will most like not 
work. ": The idea of a Object Recoverer interface to delegate the recover of 
missing objects could solve some of these missing objects in alternate ways, 
depending of the implementation. For example, if the PDF being carved is from a 
well known collection of available PDFs, the used fonts or color spaces could 
be inferred to be the same.


was (Author: patrickdalla):
Yes, Working with invalid/inconsistent and unpredictable data could lead to 
endless loops. But, forensically, it would be worthwhile to recover as much 
info as possible. Our tool works with timeouts, to avoid/skip these kind of 
problems.

"The other idea to search for missing fonts, color spaces will most like not 
work. ": The idea of a Object Recoverer interface to delegate the recover of 
missing objects could solve some of these missing objects in alternate ways, 
depending of the implementation. For example, if the PDF being carved is from a 
well know collection of available PDFs, the used fonts or color spaces could be 
inferred to be the same.

> Inconsistent/incomplete PDF rendering
> -------------------------------------
>
>                 Key: PDFBOX-5683
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5683
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Patrick Dalla Bernardina
>            Priority: Minor
>              Labels: Forensic
>         Attachments: pdf1.pdf
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> We have integrated tika and its default parsers in a Forensic Tool (IPED). As 
> a forensic tool, it tries to recover/carve deleted PDF files, some of which 
> can be partially recovered.
> PDFParser throws Exception if there is no PDF version header on the parsed 
> content, avoiding any further content parsing.
> Commenting this exception, the parser still throws the exception "Missing 
> root object specification in trail" in initialParse method, as this root 
> object is normally at begin of a PDF file.
> Although, I could made some simple effort to build a "fake" root 
> COSDictionary and build the PAGES entry with the recoverable PAGEs, searching 
> them from document.getXrefTable();
> {code:java}
>     protected void initialParse() throws IOException
>     {
>         COSDictionary trailer = retrieveTrailer();
>     
>         COSDictionary root = trailer.getCOSDictionary(COSName.ROOT);
>         if (root == null)
>         {
>             // rebuild root from xref recovered info
>             root = new COSDictionary();
>             root.setItem(COSName.TYPE, COSName.CATALOG);
>             trailer.setItem(COSName.ROOT, root);
>             // identify recovered pages from xref to mount COSName.PAGES
>             Map<COSObjectKey, Long> xrefTable = document.getXrefTable();
>             COSArray kids = new COSArray();
>             for (Entry<COSObjectKey, Long> e : xrefTable.entrySet()) {
>                 COSObject o = document.getObjectFromPool(e.getKey());
>                 if (o.getObject() instanceof COSDictionary) {
>                     COSDictionary d = (COSDictionary) o.getObject();
>                     COSName type = d.getCOSName(COSName.TYPE);
>                     if (type != null) {
>                         if (type.equals(COSName.PAGE)) {
>                             kids.add(d);
>                         }
>                     }
>                 }
>             }
>             
>             COSDictionary pages = new COSDictionary();
>             pages.setItem(COSName.TYPE, COSName.PAGES);
>             pages.setItem(COSName.COUNT, COSInteger.get(kids.size()));
>             pages.setItem(COSName.KIDS, kids);
>             document.setDecrypted();
>             root.setItem(COSName.PAGES, pages);
>             initialParseDone = true;
>             return;
>             // throw new IOException("Missing root object specification in 
> trailer.");
>         }
>         // in some pdfs the type value "Catalog" is missing in the root object
>         if (isLenient() && !root.containsKey(COSName.TYPE))
>         {
>             root.setItem(COSName.TYPE, COSName.CATALOG);
>         }
>         // check pages dictionaries
>         checkPages(root);
>         document.setDecrypted();
>         initialParseDone = true;
>     }
> {code}
> This simple effort was enough to show the recoverable pages on PDFDebugger, 
> export XMP metadatas, text, and get the rendered pages buffered images to use 
> on my OCR module.
> So, it would be very useful if PDFBOX already have some optional 
> parameterizable mode to open/recover inconsistent/incomplete pdf file, with 
> at least the implementation above or further recover actions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to