[ https://issues.apache.org/jira/browse/PDFBOX-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832306#comment-16832306 ]
Tilman Hausherr commented on PDFBOX-4532: ----------------------------------------- Here's some code for anybody who wants to understand or work on the problem: {code:java} public class PDFBox4532ExtractText extends PDFTextStripper { public static void main(String[] args) throws IOException { PDDocument doc = PDDocument.load(new File("PDFBOX-4532-reduced.pdf")); PDFTextStripper stripper = new PDFBox4532ExtractText(); stripper.getText(doc); } public PDFBox4532ExtractText() throws IOException { addOperator(new BeginMarkedContentSequenceWithProperties()); addOperator(new EndMarkedContentSequence()); } @Override public void beginMarkedContentSequence(COSName tag, COSDictionary properties) { System.out.println("BMC: tag: " + tag.getName() + ", properties: " + properties); if (properties != null && properties.containsKey(COSName.ACTUAL_TEXT)) { System.out.println("BMC: TextMatrix: " + getTextMatrix()); System.out.println("BMC: ActualText: " + properties.getString(COSName.ACTUAL_TEXT)); } super.beginMarkedContentSequence(tag, properties); } @Override public void endMarkedContentSequence() { System.out.println("EMC: TextMatrix: " + getTextMatrix()); System.out.println("EMC: CharactersByArticle: " + charactersByArticle); super.endMarkedContentSequence(); } }{code} The output is: {noformat} BMC: tag: Span, properties: COSDictionary{COSName{Lang}:COSString{en-US};COSName{MCID}:COSInt{8};} BMC: tag: Span, properties: COSDictionary{COSName{ActualText}:COSString{.};} BMC: TextMatrix: [50.0,0.0,0.0,50.0,125.0,700.0] BMC: ActualText: . EMC: TextMatrix: [50.0,0.0,0.0,50.0,137.5,700.0] EMC: CharactersByArticle: [[0, ]] EMC: TextMatrix: null EMC: CharactersByArticle: [[0, , 1]] {noformat} > PDFTextStripper replacing the decimal with white space > ------------------------------------------------------ > > Key: PDFBOX-4532 > URL: https://issues.apache.org/jira/browse/PDFBOX-4532 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.15 > Reporter: Akash Gupta > Priority: Major > Labels: ActualText > Attachments: FSUSA00BDD.pdf, PDFBOX-4532-reduced.pdf, > code_textStripper.PNG, numbers_without_decimal.PNG > > > I'm using the PDFTextStripperByArea to be specific and trying to extract a > particular area from the document. > In the output most the numbers (all but one) have their decimal point > replaced by a white space. When I copy and paste the text using Abobe > reader/chrome the decimal point are preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org