[ 
https://issues.apache.org/jira/browse/PDFBOX-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772808#comment-17772808
 ] 

Andreas Lehmkühler commented on PDFBOX-5683:
--------------------------------------------

Sorry for answering that late.

I'm afraid your approach to replace missing objects by "well known ones" won't 
work in most cases.

Let me try to explain why.

As I already wrote, PDFBox already has a lot of buildtin self-healing 
mechanism. If some indirect object references can't be resolved a brute force 
search is triggered trying to rebuild to whole pdf. It does quite a good job 
and I'm happy to improve it if you have some specific case including a sample 
pdf.

You are writting about missing resources and your plan to replace them by using 
the resources from similar pdfs as source. I'm not sure on how you want to 
achieve that. 

All resources are named using typical abbreviations like F1, F2 for fonts, Im0, 
Im1 for images, cs1, cs2 for colorspaces and so on. There is no guarantee that 
identical names reference the same resource if your are comparing two pdfs. 
Even within a pdf those names doesn't guarantee anything if you compare the 
resources of two different pages. The font F1 on page 1 isn't necessarily the 
same font as font F1 on page 2.

If your resources dictionary is intact and you are able to map the name to a 
COSObject you are facing the next challenge. A COSObject represents an indirect 
object. Its main value is the object number which is used to dereference the 
COS-object the COSObject is representing. The parser uses the object number to 
get the offset within the pdf from the xref table and tries to read the object. 
If it fails there won't be anything but the COSObject itself. You ended up with 
a object number. Those numbers are created randomly and you can't expect to get 
the same object numbers within different pdfs, even if they are very similar.

Even if you are magically able to replace a missing object, when it comes to 
fonts you may encounter the next issue, font subsets. Those are used to reduce 
the size of embedded fonts by limiting a font to the characters which are used 
within the pdf/page. You can't use the replacement if your pdf uses some 
character which isn't used in the other pdf/page. And maybe there are 
additional issue with arbitrary glyph encodings.

Maybe I did't get you point right, but I'm afraid there isn't much to improve 
at that level. But, please, proof me wrong and I'm happily going to add any 
reasonable improvement to our parser.



> Inconsistent/incomplete PDF rendering
> -------------------------------------
>
>                 Key: PDFBOX-5683
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5683
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Patrick Dalla Bernardina
>            Priority: Minor
>              Labels: Forensic
>         Attachments: pdf1.pdf
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> We have integrated tika and its default parsers in a Forensic Tool (IPED). As 
> a forensic tool, it tries to recover/carve deleted PDF files, some of which 
> can be partially recovered.
> PDFParser throws Exception if there is no PDF version header on the parsed 
> content, avoiding any further content parsing.
> Commenting this exception, the parser still throws the exception "Missing 
> root object specification in trail" in initialParse method, as this root 
> object is normally at begin of a PDF file.
> Although, I could made some simple effort to build a "fake" root 
> COSDictionary and build the PAGES entry with the recoverable PAGEs, searching 
> them from document.getXrefTable();
> {code:java}
>     protected void initialParse() throws IOException
>     {
>         COSDictionary trailer = retrieveTrailer();
>     
>         COSDictionary root = trailer.getCOSDictionary(COSName.ROOT);
>         if (root == null)
>         {
>             // rebuild root from xref recovered info
>             root = new COSDictionary();
>             root.setItem(COSName.TYPE, COSName.CATALOG);
>             trailer.setItem(COSName.ROOT, root);
>             // identify recovered pages from xref to mount COSName.PAGES
>             Map<COSObjectKey, Long> xrefTable = document.getXrefTable();
>             COSArray kids = new COSArray();
>             for (Entry<COSObjectKey, Long> e : xrefTable.entrySet()) {
>                 COSObject o = document.getObjectFromPool(e.getKey());
>                 if (o.getObject() instanceof COSDictionary) {
>                     COSDictionary d = (COSDictionary) o.getObject();
>                     COSName type = d.getCOSName(COSName.TYPE);
>                     if (type != null) {
>                         if (type.equals(COSName.PAGE)) {
>                             kids.add(d);
>                         }
>                     }
>                 }
>             }
>             
>             COSDictionary pages = new COSDictionary();
>             pages.setItem(COSName.TYPE, COSName.PAGES);
>             pages.setItem(COSName.COUNT, COSInteger.get(kids.size()));
>             pages.setItem(COSName.KIDS, kids);
>             document.setDecrypted();
>             root.setItem(COSName.PAGES, pages);
>             initialParseDone = true;
>             return;
>             // throw new IOException("Missing root object specification in 
> trailer.");
>         }
>         // in some pdfs the type value "Catalog" is missing in the root object
>         if (isLenient() && !root.containsKey(COSName.TYPE))
>         {
>             root.setItem(COSName.TYPE, COSName.CATALOG);
>         }
>         // check pages dictionaries
>         checkPages(root);
>         document.setDecrypted();
>         initialParseDone = true;
>     }
> {code}
> This simple effort was enough to show the recoverable pages on PDFDebugger, 
> export XMP metadatas, text, and get the rendered pages buffered images to use 
> on my OCR module.
> So, it would be very useful if PDFBOX already have some optional 
> parameterizable mode to open/recover inconsistent/incomplete pdf file, with 
> at least the implementation above or further recover actions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to