Thanks Timo, I will give a try of your fix and let you know.
Currently I was working on solution which will ignore text extraction on a page containing images (scanned page), I am done with my changes but still need to validate it by some performance tests. This at least will not crash my application if someone uploads scanned pdf on loaded system. I was wondering if we have some configuration by which we can ignore rendering (text extraction) of images in pdf, in my case this would be scanned pages?. Thanks Mahesh On Fri, Jan 27, 2012 at 3:30 PM, Timo Boehme <[email protected]>wrote: > I continue this thread on dev list in order to not clutter JIRA issue > PDFBOX-847. > > Mahesh Yadav commented on PDFBOX-847: >> ------------------------------**------- >> ... >> We use jackrabbit and only difference that we have is we have our own >> custom parser (not provided by jackrabbit) for parsing pdf and we interact >> with pdfbox as shown below. >> >> PDFParser parser = new PDFParser(new BufferedInputStream(stream)); >> PDDocument document = parser.getPDDocument(); >> parser.parse(); >> PDFTextStripper stripper = new PDFTextStripper(); >> stripper.setLineSeparator("\n"**); >> stripper.writeText(document, writer) >> >> I think we need to change above approach and use " PDDocument.load" with >> RandomAccessFile >> > > if you set a temporary directory before parse() with > parser.setTempDirectory > it will automatically use temporary file instead of memory buffer. > > > Timo > > -- > > Timo Boehme > OntoChem GmbH > H.-Damerow-Str. 4 > 06120 Halle/Saale > T: +49 345 4780474 > F: +49 345 4780471 > [email protected] > > ______________________________**______________________________**_________ > > OntoChem GmbH > Geschäftsführer: Dr. Lutz Weber > Sitz: Halle / Saale > Registergericht: Stendal > Registernummer: HRB 215461 > ______________________________**______________________________**_________ > >
