Comparison of "just text" isn't a good approach, since that certainly doesn't compare any other aspects of the page (or the document). Also, your approach to "text extraction" isn't really text it's just the raw operators from the PDF. Without taking into account the font & encoding information - you don't have text and thus aren't even comparing the right thing there...
And you have a LOT of work ahead of you... Leonard -----Original Message----- From: OscarP [mailto:opasc...@gmail.com] Sent: Tuesday, May 26, 2009 11:44 AM To: itext-questions@lists.sourceforge.net Subject: Re: [iText-questions] How can compare the content of two revision Hi Leonard, OscarP wrote: > > Probably this is not the right way to get the PDF contents, but I see no > other way to do it, and I don't > know what else I can try. > I know this. My code extracts only the text of one page and then compares this text with another page. If i Knew how to do it, i wouldn't ask for help here. But you are right haven't read the PDF Reference/ISO 32000-1. I am reading it now. Thanks anyway Leonard Rosenthol-3 wrote: > > Clearly, you haven't read the PDF Reference/ISO 32000-1 in order to > understand PDF and all that it contains if you believe that your presented > code is, in any way, a valid way to compare documents... > > Leonard > > -----Original Message----- > From: OscarP [mailto:opasc...@gmail.com] > Sent: Tuesday, May 26, 2009 10:30 AM > To: itext-questions@lists.sourceforge.net > Subject: Re: [iText-questions] How can compare the content of two revision > > > Hi, > > Ok Michael, i was able to get the PDF contents, but my method doesn't work > for all PDF files. For instance, I can't get it to work with PDF files > generated with OpenOffice. > > import java.io.*; > import java.util.*; > > import com.lowagie.text.*; > import com.lowagie.text.pdf.*; > > public class Example { > public static void main(String[] args) { > comprobar("d:\\pruebas\\PDFs\\textoF2I.pdf"); > } > > public static void comprobar(String fichero) { > System.out.println("/////////////////////////////////////"); > System.out.println(fichero); > System.out.println("/////////////////////////////////////"); > try { > PdfReader reader1 = new PdfReader(fichero); > System.out.println(obtenerPaginaPDF(reader1,1)); > }catch(Exception e){ > e.printStackTrace(); > } > } > > public static String obtenerPaginaPDF (PdfReader reader,int i){ > try{ > PdfDictionary page = reader.getPageN(i); > byte[] streamBytes = getStreamBytes(page); > > PRTokeniser tokenizer = new PRTokeniser(streamBytes); > StringBuffer sb = new StringBuffer(); > boolean arrayAbierto = false; > while (tokenizer.nextToken()) { > if (tokenizer.getTokenType() == > PRTokeniser.TK_STRING) { > if (tokenizer.getStringValue().equals(" > ") && !arrayAbierto) > sb.append("\n"); > else > > sb.append(tokenizer.getStringValue()); > } > else if (tokenizer.getTokenType() == > PRTokeniser.TK_START_ARRAY) { > arrayAbierto=true; > } > else if (tokenizer.getTokenType() == > PRTokeniser.TK_END_ARRAY) { > arrayAbierto=false; > sb.append("\n"); > } > } > return sb.toString(); > } catch (IOException e) { > // TODO Bloque catch generado automáticamente > e.printStackTrace(); > } > return null; > > } > > private static byte[] getStreamBytes(PdfDictionary page) throws > IOException{ > PdfObject resources = page.get(PdfName.RESOURCES); > > byte[] streamBytes=null; > if (resources instanceof PdfDictionary){ > try{ > PdfDictionary object = (PdfDictionary) > ((PdfDictionary)resources).get(PdfName.XOBJECT); > if (object!=null){ > Set set = object.getKeys(); > Iterator it = set.iterator(); > while (it.hasNext()){ > PdfName s = (PdfName) it.next(); > if (object.get(s) instanceof > PRIndirectReference){ > PRIndirectReference > objectReference = (PRIndirectReference) > object.get(s); > PRStream stream = > (PRStream) PdfReader > > .getPdfObject(objectReference); > streamBytes = > PdfReader.getStreamBytes(stream); > } > } > } > }catch(Exception e){ > e.printStackTrace(); > } > } > else if (resources instanceof PRIndirectReference){ > try{ > PdfDictionary object = > (PdfDictionary)PdfReader.getPdfObject(resources); > if (object!=null){ > Set set = object.getKeys(); > Iterator it = set.iterator(); > while (it.hasNext()){ > PdfName s = (PdfName) it.next(); > if (object.get(s) instanceof > PRIndirectReference){ > PRIndirectReference > objectReference = (PRIndirectReference) > object.get(s); > PRStream stream = > (PRStream) PdfReader > > .getPdfObject(objectReference); > streamBytes = > PdfReader.getStreamBytes(stream); > } > } > } > }catch(Exception e){ > } > } > if (streamBytes==null){ > PdfObject ob = page.get(PdfName.CONTENTS); > if (ob instanceof PRIndirectReference){ > PRIndirectReference contents = > (PRIndirectReference) > page.get(PdfName.CONTENTS); > PRStream streamContents = (PRStream) > PdfReader.getPdfObject(contents); > streamBytes = > PdfReader.getStreamBytes(streamContents); > } > else if (ob instanceof PdfArray){ > for (int j=0;j<((PdfArray)ob).size();j++){ > PRIndirectReference ir = > (PRIndirectReference)((PdfArray)ob).getPdfObject(j); > PRStream streamContents = (PRStream) > PdfReader.getPdfObject(ir); > streamBytes = > PdfReader.getStreamBytes(streamContents); > } > } > } > return streamBytes; > } > } > > Probably this is not the right way to get the PDF contents, but I see no > other way to do it, and I don't know what else I can try. > > I had execute this code with this files: > - Generate with Acrobat Profesional > http://www.nabble.com/file/p23723941/firmado2vecesOk.pdf > firmado2vecesOk.pdf > . > - Generate with GosthScript > http://www.nabble.com/file/p23723941/2274_2007_H_PROVISIONAL.pdf > 2274_2007_H_PROVISIONAL.pdf . > - Generate with MSWord > http://www.nabble.com/file/p23723941/Security%2BArchitecture.pdf > Security+Architecture.pdf > - Generate with OpenOffice > http://www.nabble.com/file/p23723941/Prueba-para-Oscar.pdf > Prueba-para-Oscar.pdf > > All the examples work "fine", i haven't tested them with embedded images, > except the OpenOffice one. > > Could you please show me an example on how to do this? Could you at least > tell me what is going wrong? > > > Thank you very much in advance. > > > > mkl wrote: >> >> Oscar, >> >> >> OscarP wrote: >>> >>> OK, >>> took several days working on this, but I can not find out anything, how >>> can I get those differences? I've analysed the binary of this document >>> http://www.nabble.com/file/p23704652/textoF2IMod.pdf textoF2IMod.pdf , >>> but the object with the difference (70 0) returns null with the itext >>> (reader.refObj[70]). >>> >> >> 70 0 contains a cross-reference stream. iText hides away cross-reference >> streams it comes along when collecting cross-reference information by >> explicitely marking the matching entry in memory as a freed object. ( "if >> (thisStream < xref.length) xref[thisStream] = -1;" in >> PdfReader.readXRefStream) >> >> (Actually 70 0 is the cross reference stream holding only the information >> about object 70 0...) >> >> The rationale for this might be some self protection; usually you never >> tamper with any former cross-reference tables or streams. When trying to >> inspect a PDF in detail this is a bit uncomfortable, though. >> >> >> OscarP wrote: >>> >>> To sum it all up, I need to know whether there are differences between >>> one signature and the other. I'd be very grateful if you could tell me >>> the way to get that result with iText. >>> >> >> Whether there are differences between the signatures? You refer to the >> signature containers or the whole signature dictionaries? Either way, >> they >> are directly available from the AcroFields, aren't they? >> >> Regards, Michael. >> > > -- > View this message in context: > http://www.nabble.com/How-can-compare-the-content-of-two-revision-tp23649348p23723941.html > Sent from the iText - General mailing list archive at Nabble.com. > > > ------------------------------------------------------------------------------ > Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT > is a gathering of tech-side developers & brand creativity professionals. > Meet > the minds behind Google Creative Lab, Visual Complexity, Processing, & > iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian > Group, R/GA, & Big Spaceship. http://www.creativitycat.com > _______________________________________________ > iText-questions mailing list > iText-questions@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.1t3xt.com/docs/book.php > Check the site with examples before you ask questions: > http://www.1t3xt.info/examples/ > You can also search the keywords list: > http://1t3xt.info/tutorials/keywords/ > ------------------------------------------------------------------------------ > Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT > is a gathering of tech-side developers & brand creativity professionals. > Meet > the minds behind Google Creative Lab, Visual Complexity, Processing, & > iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian > Group, R/GA, & Big Spaceship. http://www.creativitycat.com > _______________________________________________ > iText-questions mailing list > iText-questions@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/itext-questions > > Buy the iText book: http://www.1t3xt.com/docs/book.php > Check the site with examples before you ask questions: > http://www.1t3xt.info/examples/ > You can also search the keywords list: > http://1t3xt.info/tutorials/keywords/ > -- View this message in context: http://www.nabble.com/How-can-compare-the-content-of-two-revision-tp23649348p23725846.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT is a gathering of tech-side developers & brand creativity professionals. Meet the minds behind Google Creative Lab, Visual Complexity, Processing, & iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian Group, R/GA, & Big Spaceship. http://www.creativitycat.com _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/ ------------------------------------------------------------------------------ Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT is a gathering of tech-side developers & brand creativity professionals. Meet the minds behind Google Creative Lab, Visual Complexity, Processing, & iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian Group, R/GA, & Big Spaceship. http://www.creativitycat.com _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/