I was wondering if we have some configuration by which we can ignore
rendering (text extraction) of images in pdf, in my case this would be
scanned pages?.
Depending on PDFBox properties PDF operators are handled by specified
classes or not, e.g. PDFTextStripper.properties does not handle BI
(begin image) and 'Do' operator does not handle xobject images.
Independent of this setting stream data is parsed for all objects (with
current parser).
Timo
On Fri, Jan 27, 2012 at 3:30 PM, Timo Boehme<[email protected]>wrote:
I continue this thread on dev list in order to not clutter JIRA issue
PDFBOX-847.
Mahesh Yadav commented on PDFBOX-847:
------------------------------**-------
...
We use jackrabbit and only difference that we have is we have our own
custom parser (not provided by jackrabbit) for parsing pdf and we interact
with pdfbox as shown below.
PDFParser parser = new PDFParser(new BufferedInputStream(stream));
PDDocument document = parser.getPDDocument();
parser.parse();
PDFTextStripper stripper = new PDFTextStripper();
stripper.setLineSeparator("\n"**);
stripper.writeText(document, writer)
I think we need to change above approach and use " PDDocument.load" with
RandomAccessFile
if you set a temporary directory before parse() with
parser.setTempDirectory
it will automatically use temporary file instead of memory buffer.
Timo
--
Timo Boehme
OntoChem GmbH
H.-Damerow-Str. 4
06120 Halle/Saale
T: +49 345 4780474
F: +49 345 4780471
[email protected]
______________________________**______________________________**_________
OntoChem GmbH
Geschäftsführer: Dr. Lutz Weber
Sitz: Halle / Saale
Registergericht: Stendal
Registernummer: HRB 215461
______________________________**______________________________**_________
--
Timo Boehme
OntoChem GmbH
H.-Damerow-Str. 4
06120 Halle/Saale
T: +49 345 4780474
F: +49 345 4780471
[email protected]
_____________________________________________________________________
OntoChem GmbH
Geschäftsführer: Dr. Lutz Weber
Sitz: Halle / Saale
Registergericht: Stendal
Registernummer: HRB 215461
_____________________________________________________________________