I was wondering if we have some configuration by which we can ignore
rendering (text extraction) of images in pdf, in my case this would be
scanned pages?.

Depending on PDFBox properties PDF operators are handled by specified classes or not, e.g. PDFTextStripper.properties does not handle BI (begin image) and 'Do' operator does not handle xobject images. Independent of this setting stream data is parsed for all objects (with current parser).


Timo

On Fri, Jan 27, 2012 at 3:30 PM, Timo Boehme<[email protected]>wrote:

I continue this thread on dev list in order to not clutter JIRA issue
PDFBOX-847.

  Mahesh Yadav commented on PDFBOX-847:
------------------------------**-------
...
We use jackrabbit and only difference that we have is we have our own
custom parser (not provided by jackrabbit) for parsing pdf and we interact
with pdfbox as shown below.

PDFParser parser = new PDFParser(new BufferedInputStream(stream));
PDDocument document = parser.getPDDocument();
parser.parse();
PDFTextStripper stripper = new PDFTextStripper();
stripper.setLineSeparator("\n"**);
stripper.writeText(document, writer)

I think we need to change above approach and use " PDDocument.load" with
RandomAccessFile


if you set a temporary directory before parse() with
parser.setTempDirectory
it will automatically use temporary file instead of memory buffer.


Timo

--

  Timo Boehme
  OntoChem GmbH
  H.-Damerow-Str. 4
  06120 Halle/Saale
  T: +49 345 4780474
  F: +49 345 4780471
  [email protected]

______________________________**______________________________**_________

  OntoChem GmbH
  Geschäftsführer: Dr. Lutz Weber
  Sitz: Halle / Saale
  Registergericht: Stendal
  Registernummer: HRB 215461
______________________________**______________________________**_________





--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 [email protected]

_____________________________________________________________________

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_____________________________________________________________________

Reply via email to