I responded to Chris directly but wanted people on this list to be 
aware that this is supported by PDFBox and it is even documented :)

http://www.pdfbox.org/userguide/highlighting.html

Ben



> FYI
> 
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of
> [EMAIL PROTECTED]
> Sent: Thursday, March 09, 2006 6:01 AM
> To: [email protected]
> Subject: [iText-questions] Extracting text location for highlighting 
in
> reader
> 
> 
> I'm looking into how you can ask the acrobat reader web plugin to
> highlight words so that we can get hit-highlighting of web search
> working for an application.
> 
> I've read thru this document: 
> 
> 
http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pd
> f
> 
> It seems that you pass an XML document to the reader defining where 
you
> want highlighting.
> 
> However I then need to know the offset on the page of where I want to
> highlight (offset is a count either in characters or words).
> 
> So - is iText a good way to extract just the text of a page so that we
> can use it to calculate the offsets?
> 
> -- 
> Chris
> 
> At 06:01 AM 3/9/2006, [EMAIL PROTECTED] wrote:
> 
>http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.p
> >df
> >
> >It seems that you pass an XML document to the reader defining where 
you
> 
> >want highlighting.
> 
>          Correct.
> 
> 
> >However I then need to know the offset on the page of where I want 
to 
> >highlight (offset is a count either in characters or words).
> 
>          Correct.
> 
> 
> >So - is iText a good way to extract just the text of a page so that 
we 
> >can use it to calculate the offsets?
> 
>          No.
> 
>          Look at PdfBox or Multivalent.
> 
> 
> Leonard
> 
> 
> 
> On Thu, Mar 09, 2006 at 06:50:09AM -0500, Leonard Rosenthol wrote:
> > 
> > >So - is iText a good way to extract just the text of a page so 
that 
> > >we can use it to calculate the offsets?
> > 
> >         No.
> > 
> >         Look at PdfBox or Multivalent.
> 
> Thanks for the pointer. Seems like the char offset method isn't too
> reliable (something that's 150 chars inside the text fiel from PDFBox 
is
> 200 chars in 
> according to the highlighter in reader.
> 
> But - with word based offset (and a lot of guesswork as to what 
acrobat
> reader thinks is a word boundary) then this looks like it might 
actually
> fly :)
> 
> -- 
> Chris Searle
> [EMAIL PROTECTED]
> 
> 
> 
> 




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to