I responded to Chris directly but wanted people on this list to be aware that this is supported by PDFBox and it is even documented :)
http://www.pdfbox.org/userguide/highlighting.html Ben > FYI > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > [EMAIL PROTECTED] > Sent: Thursday, March 09, 2006 6:01 AM > To: [email protected] > Subject: [iText-questions] Extracting text location for highlighting in > reader > > > I'm looking into how you can ask the acrobat reader web plugin to > highlight words so that we can get hit-highlighting of web search > working for an application. > > I've read thru this document: > > http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pd > f > > It seems that you pass an XML document to the reader defining where you > want highlighting. > > However I then need to know the offset on the page of where I want to > highlight (offset is a count either in characters or words). > > So - is iText a good way to extract just the text of a page so that we > can use it to calculate the offsets? > > -- > Chris > > At 06:01 AM 3/9/2006, [EMAIL PROTECTED] wrote: > >http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.p > >df > > > >It seems that you pass an XML document to the reader defining where you > > >want highlighting. > > Correct. > > > >However I then need to know the offset on the page of where I want to > >highlight (offset is a count either in characters or words). > > Correct. > > > >So - is iText a good way to extract just the text of a page so that we > >can use it to calculate the offsets? > > No. > > Look at PdfBox or Multivalent. > > > Leonard > > > > On Thu, Mar 09, 2006 at 06:50:09AM -0500, Leonard Rosenthol wrote: > > > > >So - is iText a good way to extract just the text of a page so that > > >we can use it to calculate the offsets? > > > > No. > > > > Look at PdfBox or Multivalent. > > Thanks for the pointer. Seems like the char offset method isn't too > reliable (something that's 150 chars inside the text fiel from PDFBox is > 200 chars in > according to the highlighter in reader. > > But - with word based offset (and a lot of guesswork as to what acrobat > reader thinks is a word boundary) then this looks like it might actually > fly :) > > -- > Chris Searle > [EMAIL PROTECTED] > > > > ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
