Hi,

Some follow up on this in case anyone is interested. 

I found the reason for the difference is that some PDF's the spaces are
included as TextPosition's or whatever they are called but in others only
the non-blank characters are explicitly placed on the page. It appears that
when doing XML highlighting acrobat just counts the characters it has been
given but that the pdfbox text extraction must be cleverer and puts the
spaces back in. Thus the index from searching the text extraction is out of
sync with the character index within acrobat. 

I also had problems within the environment I was implementing this to do
with making both the PDF and the XML file available. That and the fact you
can only highlight in one colour made me look into using annotations
instead.

I have now implemented an alternative approach that involves:

Creating a class that extends PDFTextStripper (as per the example
org/apache/pdfbox/examples/util/PrintTextLocations.java)
Within this create a string containing all the text including:
  Compare Y position + width against Y position of next TextPosition to add
back in spaces.
  Compare X and when a new line remove trailing -'s if present.
Keep record of TextPosition values against indexes into this string as I
build it up.
Also keep a record of page values for each TextPosition.
Use regular expressions on the full string to locate index of start and
finish points
Find closest matching TextPositions to start and end of the regular
expression matches.
Use X, Y, width and height of these to create a List of quad points for each
page including splitting across lines and across pages.
Add Text Markup Annotations to pages based on these quad points.

The result it generally okay. If the text that is represented by a
TextPosition extends either one or both sides of the search string then I
end up highlighting more than I need. I could fix this using calculated
widths etc but as it is I am only interested in highlighting parts of the
PDF to indicate that someone may want to pay special attention when
examining it so exact matching is not really necessary. The other thing is
that you obviously need PDF's that allow you to add annotations but most of
the academic ones I am having to deal with appear to allow this by default.

Don't know if that helps anyone.

Cheers,

Paul



-----Original Message-----
From: Paul Edgington [mailto:edging...@ccdc.cam.ac.uk] 
Sent: 17 June 2009 09:34
To: pdfbox-users@incubator.apache.org
Subject: Odd highlighting caused by acrobat ignoring white spaces?

Hi,

I have been trying to use PDFbox to generate the highlighting XML file for
some scientific papers but on some of the papers the highlighting is in the
wrong place. I think I have narrowed it down to the point where it appears
that for some PDF's acrobat ignores the white space when counting character
offset and in others it does not. So for example if my highlighting was:

<loc pg=0 pos=33 len=1>

And the line was:

Mary had a little lamb it's fleece was white as snow.

Then in some PDF's the white spaces are counted and fleece is highlighted:

Mary had a little lamb it's flee*ce was white as snow.

In others it appears the white spaces are not counted and white is
highlighted:

Mary had a little lamb it's fleece was wh*ite as snow.

In both cases looking at the extracted text they both appear to contain
normal spaces.

Example PDFs that shows the latter effect can be found here:

http://pubs.acs.org/doi/abs/10.1021/ol900393x

Where the first line is "Convenient Synthesis of Tetra- and".

Using pos=21 should result 'of' being highlighted but in fact 'Tetra-' is
highlighted.

Has anyone else come across this? I can fix it in part my re-implementing
the code to generate the file to remove the count of spaces to the regex
location positions but that is of no use unless I can determine in advance
which way acrobat is going to treat the spaces, which I am struggling to do.

Thanks,

Paul


LEGAL NOTICE
Unless expressly stated otherwise, information contained in this
message is confidential. If this message is not intended for you,
please inform postmas...@ccdc.cam.ac.uk and delete the message.
The Cambridge Crystallographic Data Centre is a company Limited
by Guarantee and a Registered Charity.
Registered in England No. 2155347 Registered Charity No. 800579
Registered office 12 Union Road, Cambridge CB2 1EZ.

Reply via email to