Hi Team,

I have a problem while text extracting from pdf. When we extracting the text
words merge together.  Can you suggest me , what we have to do for the same.

 

I have attached the PDF file from which I am extracting the text. And I am
using the below code to extract the text.

 

Please help me as soon as possible.

 

private static string GetTextByArea_Orgnal(PDDocument doc, int x, int y, int
w, int h)

        {

            PDFTextStripperByArea stripper = new
PDFTextStripperByArea("UTF-8");

            stripper.setLineSeparator(" ");

            stripper.setDropThreshold(3);

            stripper.setWordSeparator(" ");

            stripper.setParagraphStart("<p>");

            stripper.setParagraphEnd("</p>");

            stripper.setIndentThreshold(1);

            stripper.setSortByPosition(true);

            //==================

 

            //==================

 

            Dimension d = new Dimension(w, h);

            Rectangle rect = new Rectangle(new Point(x, y), d);

            stripper.addRegion("class1", rect);

            java.util.List allPages =
doc.getDocumentCatalog().getAllPages();

            PDPage firstPage = (PDPage)allPages.get(0);

            //// overlay the region with a cyan rectangle to check if I got
the coordinates and dimensions right

            PDPageContentStream contentStream = new PDPageContentStream(doc,
firstPage, true, true);

            contentStream.setNonStrokingColor(Color.CYAN);

            contentStream.fillRect(x, y, w, h);

            contentStream.close();

            ////=============

            stripper.extractRegions(firstPage);

            return stripper.getTextForRegion("class1");

        }

 

 

Thanks,

Laxmi Narayan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to