Hi All,
I am working on reading the highlighted from PDF document using PDBox. I was able to read the highlighted text in single line both single and multiple words. However, I could not read the highlighted text across the lines. Please find the following sample code to read the highlighted text.

<code>
PDDocument pddDocument = PDDocument.load(new File("C:\\pdf-sample.pdf"));
        List allPages = pddDocument.getDocumentCatalog().getAllPages();
        for (int i = 0; i < allPages.size(); i++) {
            int pageNum = i + 1;
            PDPage page = (PDPage) allPages.get(i);
            List<PDAnnotation> la = page.getAnnotations();
            if (la.size() < 1) {
                continue;
            }
            System.out.println("Page number : "+pageNum);
            for (PDAnnotation pdfAnnot: la) {
                if (pdfAnnot.getSubtype().equals("Popup")) {
                    continue;
                }

PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                stripper.setSortByPosition(true);

                PDRectangle rect = pdfAnnot.getRectangle();
                float x = rect.getLowerLeftX() - 1;
                float y = rect.getUpperRightY() - 1;
                float width = rect.getWidth();
                float height = rect.getHeight() + rect.getHeight() / 4;

                int rotation = page.findRotation();
                if (rotation == 0) {
                    PDRectangle pageSize = page.getMediaBox();
                    y = pageSize.getHeight() - y;
                }

Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
                stripper.addRegion(Integer.toString(0), awtRect);
                stripper.extractRegions(page);
System.out.println("------------------------------------------------------------------");
System.out.println("Annot type = " + pdfAnnot.getSubtype()); System.out.println("Getting text from region = " + stripper.getTextForRegion(Integer.toString(0)) + "\n"); System.out.println("Getting text from comment = " + pdfAnnot.getContents());

            }
        }

</code>

While reading the highlighted text across the lines, "pdfAnnot.getRectangle()" function returns the minimum rectangle area around the text. This gives more text than required. I could not find any API to extract the exact highlighted text.

Any help will be highly appreciated.
- Thanks
CM Reddy

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to