Re: [iText-questions] How can compare the content of two revision

Leonard Rosenthol Tue, 26 May 2009 08:56:57 -0700

Comparison of "just text" isn't a good approach, since that certainly doesn't 
compare any other aspects of the page (or the document).   Also, your approach 
to "text extraction" isn't really text it's just the raw operators from the 
PDF.  Without taking into account the font & encoding information - you don't 
have text and thus aren't even comparing the right thing there...


And you have a LOT of work ahead of you...

Leonard

-----Original Message-----
From: OscarP [mailto:opasc...@gmail.com] 
Sent: Tuesday, May 26, 2009 11:44 AM
To: itext-questions@lists.sourceforge.net
Subject: Re: [iText-questions] How can compare the content of two revision


Hi Leonard,


OscarP wrote:
> 
> Probably this is not the right way to get the PDF contents, but I see no
> other way to do it, and I don't 
> know what else I can try.
> 

I know this. My code extracts only the text of one page and then compares
this text with another page. If i Knew how to do  it, i wouldn't ask for
help here.

But you are right haven't read the PDF Reference/ISO 32000-1. I am reading
it now.

Thanks anyway


Leonard Rosenthol-3 wrote:
> 
> Clearly, you haven't read the PDF Reference/ISO 32000-1 in order to
> understand PDF and all that it contains if you believe that your presented
> code is, in any way, a valid way to compare documents...
> 
> Leonard
> 
> -----Original Message-----
> From: OscarP [mailto:opasc...@gmail.com] 
> Sent: Tuesday, May 26, 2009 10:30 AM
> To: itext-questions@lists.sourceforge.net
> Subject: Re: [iText-questions] How can compare the content of two revision
> 
> 
> Hi,
> 
> Ok Michael, i was able to get the PDF contents, but my method doesn't work
> for all PDF files. For instance, I can't get it to work with PDF files
> generated with OpenOffice.
> 
> import java.io.*;
> import java.util.*;
> 
> import com.lowagie.text.*;
> import com.lowagie.text.pdf.*;
>  
> public class Example {
>       public static void main(String[] args) {
>               comprobar("d:\\pruebas\\PDFs\\textoF2I.pdf");
>       }
> 
>       public static void comprobar(String fichero) {
>               System.out.println("/////////////////////////////////////");
>               System.out.println(fichero);
>               System.out.println("/////////////////////////////////////");
>               try {
>                       PdfReader reader1 = new PdfReader(fichero);
>                       System.out.println(obtenerPaginaPDF(reader1,1));
>               }catch(Exception e){
>                       e.printStackTrace();
>               }
>       }
>       
>       public static String obtenerPaginaPDF (PdfReader reader,int i){
>               try{
>                       PdfDictionary page = reader.getPageN(i);
>                       byte[] streamBytes = getStreamBytes(page);              
>         
>                       PRTokeniser tokenizer = new PRTokeniser(streamBytes);
>                       StringBuffer sb = new StringBuffer();
>                       boolean arrayAbierto = false;
>                       while (tokenizer.nextToken()) {
>                               if (tokenizer.getTokenType() == 
> PRTokeniser.TK_STRING) {
>                                       if (tokenizer.getStringValue().equals(" 
> ") && !arrayAbierto)
>                                               sb.append("\n");
>                                       else 
>                                               
> sb.append(tokenizer.getStringValue());
>                               }
>                               else if (tokenizer.getTokenType() == 
> PRTokeniser.TK_START_ARRAY) {
>                                       arrayAbierto=true;
>                               }
>                               else if (tokenizer.getTokenType() == 
> PRTokeniser.TK_END_ARRAY) {
>                                       arrayAbierto=false;
>                                       sb.append("\n");
>                               }
>                       }
>                       return sb.toString();
>               } catch (IOException e) {
>                       // TODO Bloque catch generado automáticamente
>                       e.printStackTrace();
>               }
>               return null;
>               
>       }                       
> 
>       private static byte[] getStreamBytes(PdfDictionary page) throws
> IOException{
>               PdfObject resources = page.get(PdfName.RESOURCES);
> 
>               byte[] streamBytes=null;                        
>               if (resources instanceof PdfDictionary){
>                       try{
>                               PdfDictionary object = (PdfDictionary)
> ((PdfDictionary)resources).get(PdfName.XOBJECT);
>                               if (object!=null){
>                                       Set set = object.getKeys();
>                                       Iterator it = set.iterator();
>                                       while (it.hasNext()){
>                                               PdfName s = (PdfName) it.next();
>                                               if (object.get(s) instanceof 
> PRIndirectReference){
>                                                       PRIndirectReference 
> objectReference = (PRIndirectReference)
> object.get(s);
>                                                       PRStream stream = 
> (PRStream) PdfReader
>                                                                       
> .getPdfObject(objectReference);
>                                                       streamBytes = 
> PdfReader.getStreamBytes(stream);
>                                               }
>                                       }
>                               }
>                       }catch(Exception e){
>                               e.printStackTrace();
>                       }
>               }
>               else if (resources instanceof PRIndirectReference){
>                       try{
>                               PdfDictionary object =
> (PdfDictionary)PdfReader.getPdfObject(resources);
>                               if (object!=null){
>                                       Set set = object.getKeys();
>                                       Iterator it = set.iterator();
>                                       while (it.hasNext()){
>                                               PdfName s = (PdfName) it.next();
>                                               if (object.get(s) instanceof 
> PRIndirectReference){
>                                                       PRIndirectReference 
> objectReference = (PRIndirectReference)
> object.get(s);
>                                                       PRStream stream = 
> (PRStream) PdfReader
>                                                                       
> .getPdfObject(objectReference);
>                                                       streamBytes = 
> PdfReader.getStreamBytes(stream);
>                                               }
>                                       }
>                               }
>                       }catch(Exception e){
>                       }
>               }
>               if (streamBytes==null){
>                       PdfObject ob = page.get(PdfName.CONTENTS);
>                       if (ob instanceof PRIndirectReference){
>                               PRIndirectReference contents = 
> (PRIndirectReference)
> page.get(PdfName.CONTENTS);
>                               PRStream streamContents = (PRStream) 
> PdfReader.getPdfObject(contents);
>                               streamBytes = 
> PdfReader.getStreamBytes(streamContents);
>                       }
>                       else if (ob instanceof PdfArray){
>                               for (int j=0;j<((PdfArray)ob).size();j++){
>                                       PRIndirectReference ir =
> (PRIndirectReference)((PdfArray)ob).getPdfObject(j);
>                                       PRStream streamContents = (PRStream) 
> PdfReader.getPdfObject(ir);
>                                       streamBytes = 
> PdfReader.getStreamBytes(streamContents);
>                               }
>                       }
>               }
>               return streamBytes;
>       }
> }
> 
> Probably this is not the right way to get the PDF contents, but I see no
> other way to do it, and I don't know what else I can try.
> 
> I had execute this code with this files:
>  - Generate with Acrobat Profesional 
> http://www.nabble.com/file/p23723941/firmado2vecesOk.pdf
> firmado2vecesOk.pdf
> .
>  - Generate with GosthScript 
> http://www.nabble.com/file/p23723941/2274_2007_H_PROVISIONAL.pdf
> 2274_2007_H_PROVISIONAL.pdf .
>  - Generate with MSWord 
> http://www.nabble.com/file/p23723941/Security%2BArchitecture.pdf
> Security+Architecture.pdf 
>  - Generate with OpenOffice 
> http://www.nabble.com/file/p23723941/Prueba-para-Oscar.pdf
> Prueba-para-Oscar.pdf 
> 
> All the examples work "fine", i haven't tested them with embedded images,
> except the OpenOffice one.
> 
> Could you please show me an example on how to do this? Could you at least
> tell me what is going wrong?
> 
> 
> Thank you very much in advance.
> 
> 
> 
> mkl wrote:
>> 
>> Oscar,
>> 
>> 
>> OscarP wrote:
>>> 
>>> OK,
>>> took several days working on this, but I can not find out anything, how
>>> can I get those differences? I've analysed the binary of this document 
>>> http://www.nabble.com/file/p23704652/textoF2IMod.pdf textoF2IMod.pdf ,
>>> but the object with the difference (70 0) returns null with the itext
>>> (reader.refObj[70]).
>>> 
>> 
>> 70 0 contains a cross-reference stream. iText hides away cross-reference
>> streams it comes along when collecting cross-reference information by
>> explicitely marking the matching entry in memory as a freed object. ( "if
>> (thisStream < xref.length) xref[thisStream] = -1;" in
>> PdfReader.readXRefStream)
>> 
>> (Actually 70 0 is the cross reference stream holding only the information
>> about object 70 0...)
>> 
>> The rationale for this might be some self protection; usually you never
>> tamper with any former cross-reference tables or streams. When trying to
>> inspect a PDF in detail this is a bit uncomfortable, though.
>> 
>> 
>> OscarP wrote:
>>> 
>>> To sum it all up, I need to know whether there are differences between
>>> one signature and the other. I'd be very grateful if you could tell me
>>> the way to get that result with iText.
>>> 
>> 
>> Whether there are differences between the signatures? You refer to the
>> signature containers or the whole signature dictionaries? Either way,
>> they
>> are directly available from the AcroFields, aren't they?
>> 
>> Regards,   Michael.
>> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/How-can-compare-the-content-of-two-revision-tp23649348p23723941.html
> Sent from the iText - General mailing list archive at Nabble.com.
> 
> 
> ------------------------------------------------------------------------------
> Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
> is a gathering of tech-side developers & brand creativity professionals.
> Meet
> the minds behind Google Creative Lab, Visual Complexity, Processing, & 
> iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian
> Group, R/GA, & Big Spaceship. http://www.creativitycat.com 
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> 
> Buy the iText book: http://www.1t3xt.com/docs/book.php
> Check the site with examples before you ask questions:
> http://www.1t3xt.info/examples/
> You can also search the keywords list:
> http://1t3xt.info/tutorials/keywords/
> ------------------------------------------------------------------------------
> Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
> is a gathering of tech-side developers & brand creativity professionals.
> Meet
> the minds behind Google Creative Lab, Visual Complexity, Processing, & 
> iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian
> Group, R/GA, & Big Spaceship. http://www.creativitycat.com 
> _______________________________________________
> iText-questions mailing list
> iText-questions@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> 
> Buy the iText book: http://www.1t3xt.com/docs/book.php
> Check the site with examples before you ask questions:
> http://www.1t3xt.info/examples/
> You can also search the keywords list:
> http://1t3xt.info/tutorials/keywords/
> 

-- 
View this message in context: 
http://www.nabble.com/How-can-compare-the-content-of-two-revision-tp23649348p23725846.html
Sent from the iText - General mailing list archive at Nabble.com.


------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
is a gathering of tech-side developers & brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing, & 
iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian
Group, R/GA, & Big Spaceship. http://www.creativitycat.com 
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
is a gathering of tech-side developers & brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing, & 
iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian
Group, R/GA, & Big Spaceship. http://www.creativitycat.com 
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Re: [iText-questions] How can compare the content of two revision

Reply via email to