Re: [iText-questions] Parsing marked and unmarked content

Mark Storer Wed, 02 Jun 2010 08:48:39 -0700

There are quite a few different "draw this text" operators.  TJ and Tj
are two of... lets see... four.
'
"


(blah) '
is equivalent to 
T* (blah) Tj

1 2 (blah) " 
Is equivalent to 
1 Tw 2 Tc T* (blah) Tj

Tw sets the word spacing
Tc sets the character spacing
T* advances to the next line based on the current leading (set by TL)

Tj and TJ cover ninety-something percent of the cases.  In fact, I don't
know that I've ever seen a ' or " In The Wild.  None the less, if you
want to be thorough, include them.

--Mark Storer
  Senior Software Engineer
  Cardiff.com
 
import legalese.Disclaimer;
Disclaimer<Cardiff> DisCard = null;
 

> -----Original Message-----
> From: sal salaimani [mailto:[email protected]]
> Sent: Tuesday, June 01, 2010 3:41 PM
> To: [email protected]
> Subject: [iText-questions] Parsing marked and unmarked content
> 
> 
> I am parsing marked and unmarked content using PDfcontentParser and
> PRtokeniser classes of iText API.  Here are my algorithms.
> 
>  logic 1
>    Getting Marked Content
>       1. Look for dictionary starting point
>       2. if next token "MCID" and loop thru until I find "EMC"
operator
>       3. Inside the loop I keep concatinating string until I hit "TJ"
or
> "Tj" operator and store them in an arrarlist
> logic 2
>    Getting All text
>        1. loop thru unitl end of file
>        2. Inside the loop I keep concatinating string until I hit "TJ"
or
> "Tj" operator and store them in an arrarlist
> logic 3
>    Getting Unmarked content
>        1. I find difference between logic 1 and logic 2 arraylist and
> store
> the result.
> 
> These algorithms works for me. Can  anyone suggest me that this
algorithm
> works all test cases?
> 
> Also I am attaching the code snippet of the algorithms.
> 
> logic 1
> while(tokenizer.nextToken()){
>                           if (tokenizer.getTokenType() ==
PRTokeniser.TK_NAME
> &&
> tokenizer.getStringValue().equals("Artifact")){
> 
>                               skip_artifact_flag = true;
>                                       continue;
> 
>                           }
>                               if(tokenizer.getTokenType() ==
> PRTokeniser.TK_START_DIC){
> 
> 
> 
>                                       tokenizer.nextToken();
> 
>                                       if (
> tokenizer.getStringValue().equals("MCID") ){
>                                               skip_artifact_flag =
false;
>                                               tokenizer.nextToken();
> 
>                                   mcid_i = tokenizer.intValue();
> 
> 
>                                 //need to have loop until EMC or
>
while(tokenizer.nextToken()){
>
if(tokenizer.getTokenType() ==
> PRTokeniser.TK_OTHER &&
> tokenizer.getStringValue().equals("EMC")){
> 
>                                                               mcid_i =
-1;
>                                                               break;
> 
> 
>                                                       }
>
if(tokenizer.getTokenType() ==
> PRTokeniser.TK_STRING &&
> skip_artifact_flag == false)
>                                                            value =
value +
> tokenizer.getStringValue();
> 
>                                                       if
(tokenizer.getTokenType() ==
> PRTokeniser.TK_OTHER &&
> (tokenizer.getStringValue().equals("TJ") ||
> tokenizer.getStringValue().equals("Tj"))){
> 
>       if(!value.trim().equals("")){
>
//mcidMap.put(new
> Integer(mcid_i).toString(),value);
> 
> TxtcontentMarked.add(value);
>                                                               }
>
value = "";
>                                                       }
> 
>                                               }
> 
>                                   }
> 
> logic 2
> 
>   while (tokenizer.nextToken()  ){
>                                               // if()
> 
>                                                 if
(tokenizer.getTokenType() ==
> PRTokeniser.TK_OTHER &&
> (tokenizer.getStringValue().equals("TJ") ||
> tokenizer.getStringValue().equals("Tj"))){
> 
>
if(!value.trim().equals(""))
>
Txtcontent.add(value);
>                                                         value = "";
> 
>                                                         //break;
> 
>                                                 }
>                                                 if
(tokenizer.getTokenType() ==
> PRTokeniser.TK_STRING) value = value
> + tokenizer.getStringValue();
>                                               //
System.out.println("va ="+ value);
> 
>                                          }
> 
> logic 3
> 
> // Iterator iterator = mcidMap.keySet().iterator();
>                int arrayListSize = Txtcontent.size();
>                 // TxtNotMarked = Txtcontent;
>                int arrayListSize0 = TxtcontentMarked.size();
> 
>                       for(int k = 0; k < arrayListSize0; k++) {
>                       //{
>                 // while (iterator.hasNext()) {
>                  //   String key = iterator.next().toString();
>                  //   String value_h = mcidMap.get(key).toString();
> 
>                    // System.out.println("[ "+key + " ] " +
"[[[------]]]" +
> value_h);
> 
>                       for(int i = 0; i < arrayListSize; i++)
>                       {
> 
>                               //System.out.println("Content  =
> "+Txtcontent.get(i));
>                           if
> (TxtcontentMarked.get(k).trim().equals(Txtcontent.get(i).trim())){
>                               //TxtNotMarked.add(Txtcontent.get(i));
>                               //TxtNotMarked.remove(i);
>                               Txtcontent.remove(i);
>                               arrayListSize = Txtcontent.size();
>                           }
> 
>                       }
> 
>                  }
> 
> Sal Salaimani
> --
> View this message in context: http://itext-
> general.2136553.n4.nabble.com/Parsing-marked-and-unmarked-content-
> tp2239347p2239347.html
> Sent from the iText - General mailing list archive at Nabble.com.
> 
>
------------------------------------------------------------------------
--
> ----
> 
> _______________________________________________
> iText-questions mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> 
> Buy the iText book: http://www.itextpdf.com/book/
> Check the site with examples before you ask questions:
> http://www.1t3xt.info/examples/
> You can also search the keywords list:
> http://1t3xt.info/tutorials/keywords/
> 
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.819 / Virus Database: 271.1.1/2910 - Release Date:
06/01/10
> 11:25:00

------------------------------------------------------------------------------

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Re: [iText-questions] Parsing marked and unmarked content

Reply via email to