I am parsing marked and unmarked content using PDfcontentParser and
PRtokeniser classes of iText API. Here are my algorithms.
logic 1
Getting Marked Content
1. Look for dictionary starting point
2. if next token "MCID" and loop thru until I find "EMC" operator
3. Inside the loop I keep concatinating string until I hit "TJ" or
"Tj" operator and store them in an arrarlist
logic 2
Getting All text
1. loop thru unitl end of file
2. Inside the loop I keep concatinating string until I hit "TJ" or
"Tj" operator and store them in an arrarlist
logic 3
Getting Unmarked content
1. I find difference between logic 1 and logic 2 arraylist and store
the result.
These algorithms works for me. Can anyone suggest me that this algorithm
works all test cases?
Also I am attaching the code snippet of the algorithms.
logic 1
while(tokenizer.nextToken()){
if (tokenizer.getTokenType() == PRTokeniser.TK_NAME
&&
tokenizer.getStringValue().equals("Artifact")){
skip_artifact_flag = true;
continue;
}
if(tokenizer.getTokenType() ==
PRTokeniser.TK_START_DIC){
tokenizer.nextToken();
if (
tokenizer.getStringValue().equals("MCID") ){
skip_artifact_flag = false;
tokenizer.nextToken();
mcid_i = tokenizer.intValue();
//need to have loop until EMC or
while(tokenizer.nextToken()){
if(tokenizer.getTokenType() == PRTokeniser.TK_OTHER &&
tokenizer.getStringValue().equals("EMC")){
mcid_i = -1;
break;
}
if(tokenizer.getTokenType() == PRTokeniser.TK_STRING &&
skip_artifact_flag == false)
value = value +
tokenizer.getStringValue();
if
(tokenizer.getTokenType() == PRTokeniser.TK_OTHER &&
(tokenizer.getStringValue().equals("TJ") ||
tokenizer.getStringValue().equals("Tj"))){
if(!value.trim().equals("")){
//mcidMap.put(new Integer(mcid_i).toString(),value);
TxtcontentMarked.add(value);
}
value =
"";
}
}
}
logic 2
while (tokenizer.nextToken() ){
// if()
if (tokenizer.getTokenType()
== PRTokeniser.TK_OTHER &&
(tokenizer.getStringValue().equals("TJ") ||
tokenizer.getStringValue().equals("Tj"))){
if(!value.trim().equals(""))
Txtcontent.add(value);
value = "";
//break;
}
if (tokenizer.getTokenType()
== PRTokeniser.TK_STRING) value = value
+ tokenizer.getStringValue();
// System.out.println("va ="+
value);
}
logic 3
// Iterator iterator = mcidMap.keySet().iterator();
int arrayListSize = Txtcontent.size();
// TxtNotMarked = Txtcontent;
int arrayListSize0 = TxtcontentMarked.size();
for(int k = 0; k < arrayListSize0; k++) {
//{
// while (iterator.hasNext()) {
// String key = iterator.next().toString();
// String value_h = mcidMap.get(key).toString();
// System.out.println("[ "+key + " ] " + "[[[------]]]" +
value_h);
for(int i = 0; i < arrayListSize; i++)
{
//System.out.println("Content =
"+Txtcontent.get(i));
if
(TxtcontentMarked.get(k).trim().equals(Txtcontent.get(i).trim())){
//TxtNotMarked.add(Txtcontent.get(i));
//TxtNotMarked.remove(i);
Txtcontent.remove(i);
arrayListSize = Txtcontent.size();
}
}
}
Sal Salaimani
--
View this message in context:
http://itext-general.2136553.n4.nabble.com/Parsing-marked-and-unmarked-content-tp2239347p2239347.html
Sent from the iText - General mailing list archive at Nabble.com.
------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions:
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/