[ 
https://issues.apache.org/jira/browse/PDFBOX-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021799#comment-14021799
 ] 

Andreas Lehmkühler commented on PDFBOX-2116:
--------------------------------------------

IMHO your approach is a little bit naive, sorry no offense intended. The PDF 
format isn't a text based format. You can't find the differences by just 
comparing each single line. Additionally every pdf generator uses its own 
strategy to create pdfs. So, most likely you won't find two pdfs in wild which 
are optically *and* structurally similar/identical.
If you really want to implement your own tool you have to think about the goal 
again. Do you want to compare the files itself (differences within the 
structure of the pdf) or the content of the pdfs. IMHO the latter is the more 
obvious one. You can start with comparing the extracted text but I guess you 
have something more in your mind, like used fonts, text size, colour, 
formatting etc. That's a lot of work to do.

If suggest to start with an already existing tool (e.g 
[diffpdf|https://github.com/vslavik/diff-pdf]) that's maybe exactly what your 
looking for 

> Compare tow pdf file and hilight the mismatch value in generated pdf file 
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-2116
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2116
>             Project: PDFBox
>          Issue Type: Task
>          Components: PDModel
>    Affects Versions: 1.8.5
>         Environment: Java Environment using PDF box
>            Reporter: Amit Vishwakarma
>              Labels: test
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> {code}
>               PDDocument doc= PDDocument.load(pdf1);
>               PDDocument doc2= PDDocument.load(pdf2);
>               
>               System.out.println(doc);
>               
>               @SuppressWarnings("rawtypes")
>               List list=doc.getDocumentCatalog().getAllPages();
>               @SuppressWarnings("rawtypes")
>               List list2=doc2.getDocumentCatalog().getAllPages();
>               
>               PDFTextStripper stripper=new PDFTextStripper();
>               PDFTextStripper stripper2=new PDFTextStripper();
>               
>               String pages= null;
>               String pages2= null;
>               
>               System.out.println("list1 size : "+list.size());
>               System.out.println("list2 size : "+list2.size());
>               
>               if(list.size()==list2.size()){
>                       
>                       for(int i=1;i<=list.size();i++){
>                               stripper.setStartPage(i);
>                               stripper.setEndPage(i);
>                               
>                               stripper2.setStartPage(i);
>                               stripper2.setEndPage(i);
>                               
> //                            
> System.out.println("-----------"+stripper.getEndPage());
>                               
>                               pages = stripper.getText(doc);
>                               pages2 = stripper2.getText(doc2);
>                               
>                               String lines[] = pages.split("\\r?\\n");
>                               String lines2[] = pages2.split("\\r?\\n");
>                               
>                               System.out.println("Line in first page : 
> "+lines.length);
>                               System.out.println("Line in second page : 
> "+lines2.length);
>                               
>                               if(lines.length==lines2.length){
>                                       
>                                       for(int a=0;a<lines.length;a++){
> //                                            System.out.println(lines[a]);
> //                                            
> System.out.println("************----------**********");
>                                               String cols[] = 
> lines[a].split("\\s+");
>                                               String cols2[] = 
> lines2[a].split("\\s+");
>                                               if(cols.length==cols2.length){
>                                                       for(int 
> b=0;b<cols.length;b++){
>                                                               
> //System.out.println(cols[b].toString()+" - - - - "+cols2[b].toString());
>                                                               
> //System.out.println("Page : "+i+" Row : "+a+" Column : "+b);
>                                                               
> if(!cols[b].toString().equalsIgnoreCase(cols2[b].toString())){
>                                                                       
> System.out.println("Not matched : "+cols2[b].toString());
>                                                                       
> //System.out.println("Page : "+i+" Row : "+a+" Column : "+b);
>                                                               }
>                                                               
>                                                       }
>                                               }else{
>                                                       
> System.out.println("column are not equals");
>                                               }
>                                       }
>                                       System.out.println("******");
>                               }else{
>                                       System.out.println("Line are not equal 
> ");
>                               }
>                               
>                       }
>               }else{
>                       System.out.println("Page size is not equal");
>               }
>               
>               
>           doc.close();
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to