Thanks Timo. I have implemented code changes that will remove parsing (text extraction) of scanned pdf. It is working fine and CPU utilization is far than before change as no scanned pdf is parsed now.
Also had implemented RandomAccessFile for large pdf documents. There are some improvement as far as GC is concerned (less garbage collection activity is seen as heap space was still available) compare to RandomAccessBuffer (at least for single document). But still I will validate it using performance testing with large pdf in multi-threaded environment. Thanks Mahesh On Mon, Jan 30, 2012 at 2:12 PM, Timo Boehme <[email protected]>wrote: > I was wondering if we have some configuration by which we can ignore >> rendering (text extraction) of images in pdf, in my case this would be >> scanned pages?. >> > > Depending on PDFBox properties PDF operators are handled by specified > classes or not, e.g. PDFTextStripper.properties does not handle BI (begin > image) and 'Do' operator does not handle xobject images. Independent of > this setting stream data is parsed for all objects (with current parser). > > > Timo > > On Fri, Jan 27, 2012 at 3:30 PM, Timo > Boehme<timo.boehme@ontochem.**com<[email protected]> >> >wrote: >> >> I continue this thread on dev list in order to not clutter JIRA issue >>> PDFBOX-847. >>> >>> Mahesh Yadav commented on PDFBOX-847: >>> >>>> ------------------------------****------- >>>> >>>> ... >>>> We use jackrabbit and only difference that we have is we have our own >>>> custom parser (not provided by jackrabbit) for parsing pdf and we >>>> interact >>>> with pdfbox as shown below. >>>> >>>> PDFParser parser = new PDFParser(new BufferedInputStream(stream)); >>>> PDDocument document = parser.getPDDocument(); >>>> parser.parse(); >>>> PDFTextStripper stripper = new PDFTextStripper(); >>>> stripper.setLineSeparator("\n"****); >>>> >>>> stripper.writeText(document, writer) >>>> >>>> I think we need to change above approach and use " PDDocument.load" with >>>> RandomAccessFile >>>> >>>> >>> if you set a temporary directory before parse() with >>> parser.setTempDirectory >>> it will automatically use temporary file instead of memory buffer. >>> >>> >>> Timo >>> >>> -- >>> >>> Timo Boehme >>> OntoChem GmbH >>> H.-Damerow-Str. 4 >>> 06120 Halle/Saale >>> T: +49 345 4780474 >>> F: +49 345 4780471 >>> [email protected] >>> >>> ______________________________****____________________________** >>> __**_________ >>> >>> >>> OntoChem GmbH >>> Geschäftsführer: Dr. Lutz Weber >>> Sitz: Halle / Saale >>> Registergericht: Stendal >>> Registernummer: HRB 215461 >>> ______________________________****____________________________** >>> __**_________ >>> >>> >>> >> > > -- > > Timo Boehme > OntoChem GmbH > H.-Damerow-Str. 4 > 06120 Halle/Saale > T: +49 345 4780474 > F: +49 345 4780471 > [email protected] > > ______________________________**______________________________**_________ > > OntoChem GmbH > Geschäftsführer: Dr. Lutz Weber > Sitz: Halle / Saale > Registergericht: Stendal > Registernummer: HRB 215461 > ______________________________**______________________________**_________ > >
