There is no standard how a scanned or picture PDF will look like. It
depends on the application creating the PDF. First you might
nevertheless have some text, e.g. page numbers or there is an additional
text layer produced by OCR. Second not all applications will produce a
single image but multiple images for all areas where content is detected
- you might have tens or hundreds of small image snippets.
Thus if you have PDF from one source only you can optimize detection for
this specific kind but for the general case you will need some
heuristics using character count/position to decide if this is a scanned
PDF. Simply checking for page-size images will fail e.g. for journals
with full colored pages etc.
PS: please post this kind of questions to the user mailing list instead
of the developer list
Best,
Timo
Am 17.08.2015 um 09:07 schrieb Manfred Pock:
The Pdfboxversion is the 2.0 trunk Version.
For performance reason we render Pdf's with one picture over the whole
page (scanned pdf's) at our own. (about 2 sec faster)
The other pdf's we will render it with pdfbox. We check different
attributes from the page-resoureces (ShadingNames, ExtGSNames,
PatternNames, PropetiesNames, ColorSpaceNames) and the Count and Size of
the Picture (larger then the Mediabox). But we don't the check the
fontnames from the resources because we have ocr (unvisible) text on the
pdf-page to search in the page.
Now we have an pdf where is a an size-filled background-image and some
text overlayed. We detect this page as scanned page and so we just
render the picture.
Would there be a better solution to check/detect that an pdf-page is an
scanned pdf-page with no attitional text?
regarts, Manfred
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
--
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany
phone: +49 345 478 047 4 | fax: +49 345 478 047 1
email: [email protected] | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]