I am parsing marked and unmarked content using PDfcontentParser and
PRtokeniser classes of iText API.  Here are my algorithms.
  
 logic 1
   Getting Marked Content
      1. Look for dictionary starting point
      2. if next token "MCID" and loop thru until I find "EMC" operator
      3. Inside the loop I keep concatinating string until I hit "TJ" or
"Tj" operator and store them in an arrarlist
logic 2
   Getting All text
       1. loop thru unitl end of file
       2. Inside the loop I keep concatinating string until I hit "TJ" or
"Tj" operator and store them in an arrarlist
logic 3
   Getting Unmarked content
       1. I find difference between logic 1 and logic 2 arraylist and store
the result.

These algorithms works for me. Can  anyone suggest me that this algorithm
works all test cases?

Also I am attaching the code snippet of the algorithms.

logic 1
while(tokenizer.nextToken()){
                            if (tokenizer.getTokenType() == PRTokeniser.TK_NAME 
&&
tokenizer.getStringValue().equals("Artifact")){
                        
                                skip_artifact_flag = true; 
                                        continue;
                                      
                            }
                                if(tokenizer.getTokenType() == 
PRTokeniser.TK_START_DIC){
                                        
                        
                                
                                        tokenizer.nextToken();
                        
                                        if ( 
tokenizer.getStringValue().equals("MCID") ){
                                                skip_artifact_flag = false;
                                                tokenizer.nextToken();
                        
                                    mcid_i = tokenizer.intValue();
                                    
                                    
                                  //need to have loop until EMC or
                                                while(tokenizer.nextToken()){
                                                        
if(tokenizer.getTokenType() == PRTokeniser.TK_OTHER &&
tokenizer.getStringValue().equals("EMC")){
                        
                                                                mcid_i = -1;
                                                                break;          
                                                 
                                                        
                                                        }       
                                                        
if(tokenizer.getTokenType() == PRTokeniser.TK_STRING &&
skip_artifact_flag == false)    
                                                             value = value + 
tokenizer.getStringValue();        
                                                        
                                                        if 
(tokenizer.getTokenType() == PRTokeniser.TK_OTHER &&
(tokenizer.getStringValue().equals("TJ") ||
tokenizer.getStringValue().equals("Tj"))){
                                                                
if(!value.trim().equals("")){
                                                                    
//mcidMap.put(new Integer(mcid_i).toString(),value);
                                                                    
TxtcontentMarked.add(value);
                                                                }    
                                                                        value = 
"";
                                                        }
                                                        
                                                }
                                    
                                    }

logic 2

  while (tokenizer.nextToken()  ){
                                                // if()  
                                
                                                  if (tokenizer.getTokenType() 
== PRTokeniser.TK_OTHER &&
(tokenizer.getStringValue().equals("TJ") ||
tokenizer.getStringValue().equals("Tj"))){

                                                          
if(!value.trim().equals(""))
                                                          Txtcontent.add(value);
                                                          value = "";
                                                          
                                                          //break;
                                                           
                                                  }   
                                                  if (tokenizer.getTokenType() 
== PRTokeniser.TK_STRING) value = value
+ tokenizer.getStringValue();
                                                //  System.out.println("va ="+ 
value);
                                                  
                                           }

logic 3

// Iterator iterator = mcidMap.keySet().iterator();  
                 int arrayListSize = Txtcontent.size();
                  // TxtNotMarked = Txtcontent;
                 int arrayListSize0 = TxtcontentMarked.size();
                 
                        for(int k = 0; k < arrayListSize0; k++) {
                        //{
                  // while (iterator.hasNext()) {  
                   //   String key = iterator.next().toString();  
                   //   String value_h = mcidMap.get(key).toString();  
                      
                     // System.out.println("[ "+key + " ] " + "[[[------]]]" + 
value_h);
                      
                        for(int i = 0; i < arrayListSize; i++)
                        {
                           
                                //System.out.println("Content  = 
"+Txtcontent.get(i));
                            if 
(TxtcontentMarked.get(k).trim().equals(Txtcontent.get(i).trim())){
                                //TxtNotMarked.add(Txtcontent.get(i));
                                //TxtNotMarked.remove(i);
                                Txtcontent.remove(i);
                                arrayListSize = Txtcontent.size();
                            }
                           
                        }
                      
                   }  

Sal Salaimani
-- 
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Parsing-marked-and-unmarked-content-tp2239347p2239347.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Reply via email to