The font information is useful, most importantly, as it is necessary to convert from the drawing operators to a standard encoding (aka Unicode). Once everything is in Unicode, for example, then the font info is less of a concern PROVIDED that all glyphs can be mapped to Unicode and not to the PUA.
If you end up with PUA issues, such as you would with some symbolic glyphs (for example), then you need more complex comparison issues. Pdf2text (assuming that's the xpdf/poppler based one) does address most of the Unicode mapping issues. Leonard -----Original Message----- From: Mike Marchywka [mailto:marchy...@hotmail.com] Sent: Tuesday, May 26, 2009 1:10 PM To: itext-questions@lists.sourceforge.net Subject: Re: [iText-questions] How can compare the content of two revision ---------------------------------------- > From: lrose...@adobe.com > To: itext-questions@lists.sourceforge.net > Date: Tue, 26 May 2009 08:56:22 -0700 > Subject: Re: [iText-questions] How can compare the content of two revision > > Comparison of "just text" isn't a good approach, since that certainly doesn't > compare any other aspects of the page (or the document). Also, your approach > to "text extraction" isn't really text it's just the raw operators from the > PDF. Without taking into account the font & encoding information - you don't > have text and thus aren't even comparing the right thing there... > > And you have a LOT of work ahead of you... I have skipped details in the thread and still plowing through my inbox but if you really WANT to compare text, and that is all I normally need, then pdf2text would work along with diff. Further, see the code I posted or grep itext source for "matrix" and you can find ways to extract text with position information. Often the font information is just a large distraction but sometimes the document authors find it quite important. > > Leonard > > -----Original Message----- > From: OscarP [mailto:opasc...@gmail.com] > Sent: Tuesday, May 26, 2009 11:44 AM > To: itext-questions@lists.sourceforge.net > Subject: Re: [iText-questions] How can compare the content of two revision > > > Hi Leonard, > > > OscarP wrote: >> >> Probably this is not the right way to get the PDF contents, but I see no >> other way to do it, and I don't >> know what else I can try. >> > > I know this. My code extracts only the text of one page and then compares > this text with another page. If i Knew how to do it, i wouldn't ask for > help here. > > But you are right haven't read the PDF Reference/ISO 32000-1. I am reading > it now. > > Thanks anyway > > > Leonard Rosenthol-3 wrote: >> >> Clearly, you haven't read the PDF Reference/ISO 32000-1 in order to >> understand PDF and all that it contains if you believe that your presented >> code is, in any way, a valid way to compare documents... >> >> Leonard _________________________________________________________________ Hotmail(r) goes with you. http://windowslive.com/Tutorial/Hotmail/Mobile?ocid=TXT_TAGLM_WL_HM_Tutorial_Mobile1_052009 ------------------------------------------------------------------------------ Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT is a gathering of tech-side developers & brand creativity professionals. Meet the minds behind Google Creative Lab, Visual Complexity, Processing, & iPhoneDevCamp as they present alongside digital heavyweights like Barbarian Group, R/GA, & Big Spaceship. http://p.sf.net/sfu/creativitycat-com _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/ ------------------------------------------------------------------------------ Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT is a gathering of tech-side developers & brand creativity professionals. Meet the minds behind Google Creative Lab, Visual Complexity, Processing, & iPhoneDevCamp as they present alongside digital heavyweights like Barbarian Group, R/GA, & Big Spaceship. http://p.sf.net/sfu/creativitycat-com _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/