Re: FW: Word Merging Problem
I tried running your code and I can't because it was written for an older version of PDFBox (probably 1.8) and it has a syntax error and the parameters are missing so I doubt your code ever ran that way. I tried running ExtractText on PDFBox 1.8 and yes, many blanks are missing. So please use the current version 2.0.8. I found one occurrence where the blank was missing ("Wewould") but Adobe Reader has the same problem. Tilman Am 25.01.2018 um 04:22 schrieb Laxmi Narayan: Hi Team, I have a problem while text extracting from pdf. When we extracting the text words merge together. Can you suggest me , what we have to do for the same. I have attached the PDF file from which I am extracting the text. And I am using the below code to extract the text. Please help me as soon as possible. privatestatic string GetTextByArea_Orgnal(PDDocument doc, int x, int y, int w, int h) { PDFTextStripperByArea stripper = new PDFTextStripperByArea("UTF-8"); stripper.setLineSeparator(" "); stripper.setDropThreshold(3); stripper.setWordSeparator(" "); stripper.setParagraphStart(""); stripper.setParagraphEnd(""); stripper.setIndentThreshold(1); stripper.setSortByPosition(true); //== //== Dimension d = new Dimension(w, h); Rectangle rect = new Rectangle(new Point(x, y), d); stripper.addRegion("class1", rect); java.util.List allPages = doc.getDocumentCatalog().getAllPages(); PDPage firstPage = (PDPage)allPages.get(0); overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right PDPageContentStream contentStream = new PDPageContentStream(doc, firstPage, true, true); contentStream.setNonStrokingColor(Color.CYAN); contentStream.fillRect(x, y, w, h); contentStream.close(); = stripper.extractRegions(firstPage); return stripper.getTextForRegion("class1"); } Thanks, Laxmi Narayan - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: FW: Word Merging Problem
Hi, Please upload your file to a sharehoster. PDF files don't go through. And please tell what PDF version you're using (hopefully 2.0.8). And please post to the user, not to the dev mailing list. I was able to access your file because your post was stuck in moderation. I don't have the time to try your code now (will do tonight). I tried with the ExtractText command line utility and that one did have blanks. Tilman Am 25.01.2018 um 04:22 schrieb Laxmi Narayan: Hi Team, I have a problem while text extracting from pdf. When we extracting the text words merge together. Can you suggest me , what we have to do for the same. I have attached the PDF file from which I am extracting the text. And I am using the below code to extract the text. Please help me as soon as possible. privatestatic string GetTextByArea_Orgnal(PDDocument doc, int x, int y, int w, int h) { PDFTextStripperByArea stripper = new PDFTextStripperByArea("UTF-8"); stripper.setLineSeparator(" "); stripper.setDropThreshold(3); stripper.setWordSeparator(" "); stripper.setParagraphStart(""); stripper.setParagraphEnd(""); stripper.setIndentThreshold(1); stripper.setSortByPosition(true); //== //== Dimension d = new Dimension(w, h); Rectangle rect = new Rectangle(new Point(x, y), d); stripper.addRegion("class1", rect); java.util.List allPages = doc.getDocumentCatalog().getAllPages(); PDPage firstPage = (PDPage)allPages.get(0); overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right PDPageContentStream contentStream = new PDPageContentStream(doc, firstPage, true, true); contentStream.setNonStrokingColor(Color.CYAN); contentStream.fillRect(x, y, w, h); contentStream.close(); = stripper.extractRegions(firstPage); return stripper.getTextForRegion("class1"); } Thanks, Laxmi Narayan - To unsubscribe, e-mail:dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail:dev-h...@pdfbox.apache.org
FW: Word Merging Problem
Hi Team, I have a problem while text extracting from pdf. When we extracting the text words merge together. Can you suggest me , what we have to do for the same. I have attached the PDF file from which I am extracting the text. And I am using the below code to extract the text. Please help me as soon as possible. private static string GetTextByArea_Orgnal(PDDocument doc, int x, int y, int w, int h) { PDFTextStripperByArea stripper = new PDFTextStripperByArea("UTF-8"); stripper.setLineSeparator(" "); stripper.setDropThreshold(3); stripper.setWordSeparator(" "); stripper.setParagraphStart(""); stripper.setParagraphEnd(""); stripper.setIndentThreshold(1); stripper.setSortByPosition(true); //== //== Dimension d = new Dimension(w, h); Rectangle rect = new Rectangle(new Point(x, y), d); stripper.addRegion("class1", rect); java.util.List allPages = doc.getDocumentCatalog().getAllPages(); PDPage firstPage = (PDPage)allPages.get(0); overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right PDPageContentStream contentStream = new PDPageContentStream(doc, firstPage, true, true); contentStream.setNonStrokingColor(Color.CYAN); contentStream.fillRect(x, y, w, h); contentStream.close(); = stripper.extractRegions(firstPage); return stripper.getTextForRegion("class1"); } Thanks, Laxmi Narayan - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org