Patrick Dalla Bernardina created PDFBOX-5683:
------------------------------------------------
Summary: Inconsistent/incomplete PDF rendering
Key: PDFBOX-5683
URL: https://issues.apache.org/jira/browse/PDFBOX-5683
Project: PDFBox
Issue Type: New Feature
Components: Parsing
Reporter: Patrick Dalla Bernardina
We have integrated tika and its default parsers in a Forensic Tool (IPED). As a
forensic tool, it tries to recover/carve deleted PDF files, some of which can
be partially recovered.
PDFParser throws Exception if there is no PDF version header on the parsed
content, avoiding any further content parsing.
Commenting this exception, the parser still throws the exception "Missing root
object specification in trail" in initialParse method, as this root object is
normally at begin of a PDF file.
Although, I could made some simple effort to build a "fake" root COSDictionary
and build the PAGES entry with the recoverable PAGEs, searching them from
document.getXrefTable();
{code:java}
protected void initialParse() throws IOException
{
COSDictionary trailer = retrieveTrailer();
COSDictionary root = trailer.getCOSDictionary(COSName.ROOT);
if (root == null)
{
// rebuild root from xref recovered info
root = new COSDictionary();
root.setItem(COSName.TYPE, COSName.CATALOG);
trailer.setItem(COSName.ROOT, root);
// identify recovered pages from xref to mount COSName.PAGES
Map<COSObjectKey, Long> xrefTable = document.getXrefTable();
COSArray kids = new COSArray();
for (Entry<COSObjectKey, Long> e : xrefTable.entrySet()) {
COSObject o = document.getObjectFromPool(e.getKey());
if (o.getObject() instanceof COSDictionary) {
COSDictionary d = (COSDictionary) o.getObject();
COSName type = d.getCOSName(COSName.TYPE);
if (type != null) {
if (type.equals(COSName.PAGE)) {
kids.add(d);
}
}
}
}
COSDictionary pages = new COSDictionary();
pages.setItem(COSName.TYPE, COSName.PAGES);
pages.setItem(COSName.COUNT, COSInteger.get(kids.size()));
pages.setItem(COSName.KIDS, kids);
document.setDecrypted();
root.setItem(COSName.PAGES, pages);
initialParseDone = true;
return;
// throw new IOException("Missing root object specification in
trailer.");
}
// in some pdfs the type value "Catalog" is missing in the root object
if (isLenient() && !root.containsKey(COSName.TYPE))
{
root.setItem(COSName.TYPE, COSName.CATALOG);
}
// check pages dictionaries
checkPages(root);
document.setDecrypted();
initialParseDone = true;
}
{code}
This simple effort was enough to show the recoverable pages on PDFDebugger,
export XMP metadatas, text, and get the rendered pages buffered images to use
on my OCR module.
So, it would be very useful if PDFBOX already have some optional
parameterizable mode to open/recover inconsistent/incomplete pdf file, with at
least the implementation above or further recover actions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]