Hi,

I have been trying to use PDFbox to generate the highlighting XML file for
some scientific papers but on some of the papers the highlighting is in the
wrong place. I think I have narrowed it down to the point where it appears
that for some PDF's acrobat ignores the white space when counting character
offset and in others it does not. So for example if my highlighting was:

<loc pg=0 pos=33 len=1>

And the line was:

Mary had a little lamb it's fleece was white as snow.

Then in some PDF's the white spaces are counted and fleece is highlighted:

Mary had a little lamb it's flee*ce was white as snow.

In others it appears the white spaces are not counted and white is
highlighted:

Mary had a little lamb it's fleece was wh*ite as snow.

In both cases looking at the extracted text they both appear to contain
normal spaces.

Example PDFs that shows the latter effect can be found here:

http://pubs.acs.org/doi/abs/10.1021/ol900393x

Where the first line is "Convenient Synthesis of Tetra- and".

Using pos=21 should result 'of' being highlighted but in fact 'Tetra-' is
highlighted.

Has anyone else come across this? I can fix it in part my re-implementing
the code to generate the file to remove the count of spaces to the regex
location positions but that is of no use unless I can determine in advance
which way acrobat is going to treat the spaces, which I am struggling to do.

Thanks,

Paul






LEGAL NOTICE
Unless expressly stated otherwise, information contained in this
message is confidential. If this message is not intended for you,
please inform postmas...@ccdc.cam.ac.uk and delete the message.
The Cambridge Crystallographic Data Centre is a company Limited
by Guarantee and a Registered Charity.
Registered in England No. 2155347 Registered Charity No. 800579
Registered office 12 Union Road, Cambridge CB2 1EZ.

Reply via email to