Hi, Some follow up on this in case anyone is interested.
I found the reason for the difference is that some PDF's the spaces are included as TextPosition's or whatever they are called but in others only the non-blank characters are explicitly placed on the page. It appears that when doing XML highlighting acrobat just counts the characters it has been given but that the pdfbox text extraction must be cleverer and puts the spaces back in. Thus the index from searching the text extraction is out of sync with the character index within acrobat. I also had problems within the environment I was implementing this to do with making both the PDF and the XML file available. That and the fact you can only highlight in one colour made me look into using annotations instead. I have now implemented an alternative approach that involves: Creating a class that extends PDFTextStripper (as per the example org/apache/pdfbox/examples/util/PrintTextLocations.java) Within this create a string containing all the text including: Compare Y position + width against Y position of next TextPosition to add back in spaces. Compare X and when a new line remove trailing -'s if present. Keep record of TextPosition values against indexes into this string as I build it up. Also keep a record of page values for each TextPosition. Use regular expressions on the full string to locate index of start and finish points Find closest matching TextPositions to start and end of the regular expression matches. Use X, Y, width and height of these to create a List of quad points for each page including splitting across lines and across pages. Add Text Markup Annotations to pages based on these quad points. The result it generally okay. If the text that is represented by a TextPosition extends either one or both sides of the search string then I end up highlighting more than I need. I could fix this using calculated widths etc but as it is I am only interested in highlighting parts of the PDF to indicate that someone may want to pay special attention when examining it so exact matching is not really necessary. The other thing is that you obviously need PDF's that allow you to add annotations but most of the academic ones I am having to deal with appear to allow this by default. Don't know if that helps anyone. Cheers, Paul -----Original Message----- From: Paul Edgington [mailto:edging...@ccdc.cam.ac.uk] Sent: 17 June 2009 09:34 To: pdfbox-users@incubator.apache.org Subject: Odd highlighting caused by acrobat ignoring white spaces? Hi, I have been trying to use PDFbox to generate the highlighting XML file for some scientific papers but on some of the papers the highlighting is in the wrong place. I think I have narrowed it down to the point where it appears that for some PDF's acrobat ignores the white space when counting character offset and in others it does not. So for example if my highlighting was: <loc pg=0 pos=33 len=1> And the line was: Mary had a little lamb it's fleece was white as snow. Then in some PDF's the white spaces are counted and fleece is highlighted: Mary had a little lamb it's flee*ce was white as snow. In others it appears the white spaces are not counted and white is highlighted: Mary had a little lamb it's fleece was wh*ite as snow. In both cases looking at the extracted text they both appear to contain normal spaces. Example PDFs that shows the latter effect can be found here: http://pubs.acs.org/doi/abs/10.1021/ol900393x Where the first line is "Convenient Synthesis of Tetra- and". Using pos=21 should result 'of' being highlighted but in fact 'Tetra-' is highlighted. Has anyone else come across this? I can fix it in part my re-implementing the code to generate the file to remove the count of spaces to the regex location positions but that is of no use unless I can determine in advance which way acrobat is going to treat the spaces, which I am struggling to do. Thanks, Paul LEGAL NOTICE Unless expressly stated otherwise, information contained in this message is confidential. If this message is not intended for you, please inform postmas...@ccdc.cam.ac.uk and delete the message. The Cambridge Crystallographic Data Centre is a company Limited by Guarantee and a Registered Charity. Registered in England No. 2155347 Registered Charity No. 800579 Registered office 12 Union Road, Cambridge CB2 1EZ.