Hi, I have been trying to use PDFbox to generate the highlighting XML file for some scientific papers but on some of the papers the highlighting is in the wrong place. I think I have narrowed it down to the point where it appears that for some PDF's acrobat ignores the white space when counting character offset and in others it does not. So for example if my highlighting was:
<loc pg=0 pos=33 len=1> And the line was: Mary had a little lamb it's fleece was white as snow. Then in some PDF's the white spaces are counted and fleece is highlighted: Mary had a little lamb it's flee*ce was white as snow. In others it appears the white spaces are not counted and white is highlighted: Mary had a little lamb it's fleece was wh*ite as snow. In both cases looking at the extracted text they both appear to contain normal spaces. Example PDFs that shows the latter effect can be found here: http://pubs.acs.org/doi/abs/10.1021/ol900393x Where the first line is "Convenient Synthesis of Tetra- and". Using pos=21 should result 'of' being highlighted but in fact 'Tetra-' is highlighted. Has anyone else come across this? I can fix it in part my re-implementing the code to generate the file to remove the count of spaces to the regex location positions but that is of no use unless I can determine in advance which way acrobat is going to treat the spaces, which I am struggling to do. Thanks, Paul LEGAL NOTICE Unless expressly stated otherwise, information contained in this message is confidential. If this message is not intended for you, please inform postmas...@ccdc.cam.ac.uk and delete the message. The Cambridge Crystallographic Data Centre is a company Limited by Guarantee and a Registered Charity. Registered in England No. 2155347 Registered Charity No. 800579 Registered office 12 Union Road, Cambridge CB2 1EZ.