>Does anyone have any idea how to associate content with tags especially 
>when the two trees are not in sync (see attached file for an example).

Okay... here's the beginning of the content for your sample page:
--------
BT
/P <</Attached [/Bottom ]/BBox [72 36.8699 215.9015 49.4529 ]/MCID 73 /Subtype
/Footer /Type /Pagination >>BDC 
/CS0 cs 0  scn
/TT0 1 Tf
0.001 Tc -0.003 Tw 10.98 0 0 10.98 72 39 Tm
[(C)-1(o)-2(p)2(yri)1(g)2(h)2(t)3( \251)-6( )-5(2009 )]TJ
-0.001 Tc -0.001 Tw [(Of)-2(f)-2(i)-1(c)1(e)-6( S)-6(u)-1(p)-1(pl)-2(y)]TJ
0 Tc 0 Tw 12.88 0 Td
( )Tj
EMC 
/Artifact <<>>BDC 
28.24 56.563 Td
( )Tj
EMC 
/H1 <</MCID 1 >>BDC 
/CS1 cs 0.212 0.373 0.569  scn
/TT1 1 Tf
-0.001 Tc 0.001 Tw 18 0 0 18 72 616.98 Tm
[(N)-2(ew)-3(s)-2(let)-3(t)-3(er)]TJ
0 Tc 0 Tw ( )Tj
EMC 
--------

The part we're most interested in is the /P <<...>>BDC line.  That's a direct 
content specifier.  Inside the BDC dictionary you'll find an MCID (73 in this 
case).  That's the marked content INDEX into it's containers /K(ids) array.  
The parent container is a page, and that page's StructParents value is 0.  So 
you look up the 0th item in the ParentTree (which is not a standard PdfArray, 
its a "Number Tree"), and grab the structure element at index 73.

We've now left the ParentTree and are in the logical structure.  That element's 
/P(arent) key points to it's logical container (a /Sect(ion) representing the 
entire page).  If you enumerate the parent's /K(ids) array (which is in logical 
order), looking for that element you found at index 73, you'll find out where 
in the logical order it belongs.  It happens to be the last element in logical 
order.  Perfectly legal, if a bit strange.

The content covered by this BDC ends at the EMC marker.  

The next BDC dictionary is empty, and an /Artifact.  You can safely ignore 
BDC's without an MCID.  They aren't part of the document's logical structure.  
This particular artifact is just some empty space (presumably for text 
positioning).

The next BDC dictionary only specifies an MCID, which is fine.  Using the same 
lookup strategy, we can determine that this particular section of marked 
content is the third element in the page's logical order.


If you examine this PDF with an object-level viewer (or a text editor and a
bunch of scrolling back and forth) you'll find out some Interesting Things:
1) The /K entry for the leaf structure elements is their MCID.  Just like the 
spec says... learn something new every day.
2) There are a bunch of null entries in the parentTree's first array.  I 
suspect more ID/Idx confusion.
3) Several elements in the MCID array for the page are just wrappers for 
entries from the top level parentTree... which doesn't strike me as Kosher.  
OTOH, I'm /still/ learning about marked content too (see #1 above).


--Mark


------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time, 
vendors submitting new applications to BlackBerry App World(TM) will have 
the opportunity to enter the BlackBerry Developer Challenge. See full prize 
details at: http://p.sf.net/sfu/blackberry
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Reply via email to