[ https://issues.apache.org/jira/browse/PDFBOX-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vitalie Bureanu updated PDFBOX-1542: ------------------------------------ Attachment: Parser.java Our text extractor (with coordinates for each simbol). > Whitespaces between words are not created > ----------------------------------------- > > Key: PDFBOX-1542 > URL: https://issues.apache.org/jira/browse/PDFBOX-1542 > Project: PDFBox > Issue Type: Wish > Components: Text extraction > Affects Versions: 1.7.1 > Reporter: Vitalie Bureanu > Priority: Minor > Attachments: Parser.java > > Original Estimate: 1h > Remaining Estimate: 1h > > Hello, I extract the text with PDFBox from PDF files. I noticed that > extraction of text from some pdf files are not so good as expected. I have a > seria of pdf invoices from which I try to extract the text with coordinates > and resultat is pretty well, but I noticed very strange thing: when I extract > text - the words are extracted without whitespaces bettween. Example: if I > try to extract "Unit Price" the result is "UnitPrice". > But if I open the invoice in Adobe Reader and make "Copy/Past" into > Notepad... I have the "Unit Price" with whitespaces! > I think the whitespaces are not present in original pdf document... but the > Adobe Reader in some way "insert" whitespaces between words when it show > content of the pdf. > > Guys, can you please suggest me how I can have the strings with spaces after > the parsing? > See example of invoice here: http://www.cloudforpeople.com/Invoice1.pdf > PS: I want to try the 1.8.0. version of PDFBox - how I can download it? > Many thanks, > Vitalie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira