The null is a bug in the sense that the pdf has an empty ToUnicode map and
iText expected a good map. It's fixed in the SVN.
Paulo
----- Original Message -----
From: 1T3XT BVBA
To: [email protected]
Sent: Saturday, February 18, 2012 3:19 PM
Subject: Re: [iText-questions] iText 5.1.4 text extraction issue
On 18/02/2012 15:25, sselvia wrote:
> I sent a request to get pricing for commercial support. I brought up the
PDF
> in Preview and saved the page in question to a separate PDF. When I process
> the file after the "Save As" the word "null" is returned by the getText()
> method.
> http://itext-general.2136553.n4.nabble.com/file/n4399827/iTextTest.pdf
> iTextTest.pdf
>
> Thanks for all of the help to this point.
I didn't see the question on the internal support list yet.
Am I correct that you posted the question twice, once on the free list,
once on the customer's list?
Because I can't see it on the customer's support list.
Anyway, this is the syntax inside your PDF:
q Q q 0 0.03 612 791.97 re W n /Gs1 gs /Cs1 cs 1 1 1 sc /Gs2 gs 0
0.029999 612 791.97
re f Q q 0 594.42 612 197.58 re W n /Gs2 gs q 619.2599 0 0 198.48
-3.600012 594.42
cm /Im1 Do Q Q q 0 395.94 612 198.48 re W n /Gs2 gs q 619.2599 0 0
198.48 -3.600012 395.94
cm /Im2 Do Q Q q 0 197.46 612 198.48 re W n /Gs2 gs q 619.2599 0 0
198.48 -3.600012 197.46
cm /Im3 Do Q Q q 0 0.03 612 197.43 re W n /Gs2 gs q 619.2599 0 0 198.42
-3.600012 -0.959936
cm /Im4 Do Q Q q 0 0 612 792 re W n 0.754 sc /Gs1 gs /Gs2 gs BT -0.0004 Tc
28.02 0 0 28.02 69.12 742.98 Tm /TT1.1 1 Tf (!"#$%&'\(%\)*+) Tj 0 Tc ET BT
28.02 0 0 28.02 211.5008 742.98 Tm /G2 1 Tf <0001> Tj ET BT -0.0001 Tc
28.02 0 0 28.02 217.8614 742.98
Tm /TT1.1 1 Tf (\),) Tj 0 Tc ET BT 28.02 0 0 28.02 241.74 742.98 Tm /G2 1
Tf <0001> Tj ET BT 0.0005 Tc 28.02 0 0 28.02 248.1006 742.98 Tm /TT1.1 1 Tf
(-.&.*\() Tj 0 Tc ET BT 28.02 0 0 28.02 328.6216 742.98 Tm /G2 1 Tf <0001>
Tj ET BT -0.0011 Tc 28.02 0 0 28.02 334.9822 742.98 Tm /TT1.1 1 Tf
(/012\)$\)3%&)
Tj 0 Tc ET BT 28.02 0 0 28.02 459.5423 742.98 Tm /G2 1 Tf <0001> Tj ET BT
-0.0001 Tc 28.02 0 0 28.02 465.9028 742.98 Tm /TT1.1 1 Tf (42.*1+) Tj 0 Tc
ET 0 sc BT -0.0004 Tc 28.02 0 0 28.02 67.98239 744.1204 Tm /TT1.1 1 Tf
(!"#$%&'\(%\)*+)
Tj 0 Tc ET BT 28.02 0 0 28.02 210.3632 744.1204 Tm /G2 1 Tf <0001> Tj ET BT
-0.0001 Tc 28.02 0 0 28.02 216.7238 744.1204 Tm /TT1.1 1 Tf (\),) Tj 0 Tc
ET BT 28.02 0 0 28.02 240.6024 744.1204 Tm /G2 1 Tf <0001> Tj ET BT 0.0005
Tc 28.02 0 0 28.02 246.9629 744.1204 Tm /TT1.1 1 Tf (-.&.*\() Tj 0 Tc ET BT
28.02 0 0 28.02 327.484 744.1204 Tm /G2 1 Tf <0001> Tj ET BT -0.0011 Tc
28.02 0 0 28.02 333.8445 744.1204
Tm /TT1.1 1 Tf (/012\)$\)3%&) Tj 0 Tc ET BT 28.02 0 0 28.02 458.4047
744.1204
Tm /G2 1 Tf <0001> Tj ET BT -0.0001 Tc 28.02 0 0 28.02 464.7652 744.1204 Tm
/TT1.1 1 Tf (42.*1+) Tj 0 Tc ET 0.754 sc BT -0.0023 Tc 28.02 0 0 28.02
112.5062 709.3196
Tm /TT1.1 1 Tf (\)*) Tj 0 Tc ET BT 28.02 0 0 28.02 142.566 709.3196 Tm /G2
1 Tf <0001> Tj ET BT -0.0006 Tc 28.02 0 0 28.02 148.9266 709.3196 Tm /TT1.1
1 Tf (\(5.) Tj 0 Tc ET BT 28.02 0 0 28.02 187.7455 709.3196 Tm /G2 1 Tf
<0001>
Tj ET BT 0.0018 Tc 28.02 0 0 28.02 194.106 709.3196 Tm /TT1.1 1 Tf (6',.)
Tj 0 Tc ET BT 28.02 0 0 28.02 244.2646 709.3196 Tm /G2 1 Tf <0001> Tj ET BT
-0.0001 Tc 28.02 0 0 28.02 250.6252 709.3196 Tm /TT1.1 1 Tf (7%.$1) Tj 0 Tc
ET BT 28.02 0 0 28.02 308.1054 709.3196 Tm /G2 1 Tf <0001> Tj ET BT -0.0003
Tc 28.02 0 0 28.02 314.4043 709.3196 Tm /TT1.1 1 Tf (8+\(%"'\(.) Tj 0 Tc ET
BT 28.02 0 0 28.02 416.2234 709.3196 Tm /G2 1 Tf <0001> Tj ET BT 0.0002 Tc
28.02 0 0 28.02 422.5839 709.3196 Tm /TT1.1 1 Tf (,\)2) Tj 0 Tc ET BT
28.02 0 0 28.02 456.4853 709.3196
Tm /G2 1 Tf <0001> Tj ET BT 0.0006 Tc 28.02 0 0 28.02 462.8458 709.3196 Tm
/TT1.1 1 Tf (\(5.) Tj 0 Tc ET BT 28.02 0 0 28.02 501.7852 709.3196 Tm /G2
1 Tf <0001> Tj ET 0 sc BT -0.0023 Tc 28.02 0 0 28.02 111.3658 710.46 Tm
/TT1.1
1 Tf (\)*) Tj 0 Tc ET BT 28.02 0 0 28.02 141.4256 710.46 Tm /G2 1 Tf <0001>
Tj ET BT -0.0006 Tc 28.02 0 0 28.02 147.7861 710.46 Tm /TT1.1 1 Tf (\(5.)
Tj 0 Tc ET BT 28.02 0 0 28.02 186.6051 710.46 Tm /G2 1 Tf <0001> Tj ET BT
0.0018 Tc 28.02 0 0 28.02 192.9656 710.46 Tm /TT1.1 1 Tf (6',.) Tj 0 Tc ET
BT 28.02 0 0 28.02 243.127 710.46 Tm /G2 1 Tf <0001> Tj ET BT -0.0001 Tc
28.02 0 0 28.02 249.4875 710.46
Tm /TT1.1 1 Tf (7%.$1) Tj 0 Tc ET BT 28.02 0 0 28.02 306.9678 710.46 Tm /G2
1 Tf <0001> Tj ET BT -0.0003 Tc 28.02 0 0 28.02 313.2667 710.46 Tm /TT1.1
1 Tf (8+\(%"'\(.) Tj 0 Tc ET BT 28.02 0 0 28.02 415.0858 710.46 Tm /G2 1 Tf
<0001> Tj ET BT 0.0002 Tc 28.02 0 0 28.02 421.4463 710.46 Tm /TT1.1 1 Tf
(,\)2)
Tj 0 Tc ET BT 28.02 0 0 28.02 455.3477 710.46 Tm /G2 1 Tf <0001> Tj ET BT
0.0006 Tc 28.02 0 0 28.02 461.7082 710.46 Tm /TT1.1 1 Tf (\(5.) Tj 0 Tc ET
BT 28.02 0 0 28.02 500.6476 710.46 Tm /G2 1 Tf <0001> Tj ET 0.754 sc BT
0.0004
Tc 28.02 0 0 28.02 209.7076 675.7208 Tm /TT1.1 1 Tf (9&:$';'5') Tj 0 Tc ET
BT 28.02 0 0 28.02 338.2269 675.7208 Tm /G2 1 Tf <0001> Tj ET BT -0.0004 Tc
28.02 0 0 28.02 344.5874 675.7208 Tm /TT1.1 1 Tf (-%<.2) Tj 0 Tc ET 0 sc BT
0.0004 Tc 28.02 0 0 28.02 208.5671 676.8612 Tm /TT1.1 1 Tf (9&:$';'5') Tj
0 Tc ET BT 28.02 0 0 28.02 337.0865 676.8612 Tm /G2 1 Tf <0001> Tj ET BT
-0.0004
Tc 28.02 0 0 28.02 343.447 676.8612 Tm /TT1.1 1 Tf (-%<.2) Tj 0 Tc ET 1 sc
BT 0.0008 Tc 13.98 0 0 13.98 393.12 29.1 Tm /TT3.0 1 Tf [ (SDI ) -1
(Environmental )
-1 (Services, ) -1 (Inc. ) ] TJ 0 Tc ET q 97.44208 0 0 65.99999 44.99999
51.48006
cm /Im5 Do Q BT 0.0001 Tc 13.98 0 0 13.98 45.18 28.86 Tm /TT3.0 1 Tf [
(Putnam )
1 (County ) 1 (Environmental ) 1 (Council, ) 1 (Inc.) -3 ( ) ] TJ 0 Tc ET
BT -0.0007 Tc 16.02 0 0 16.02 274.56 119.58 Tm /TT3.0 1 Tf [ (June ) 2
(2010)
] TJ 0 Tc ET q 103.7244 0 0 64.49999 488.16 52.98005 cm /Im6 Do Q Q
This is the result when iText parses this syntax for text:
Implications nullof nullRecent nullHydrologic nullTrends
Implications nullof nullRecent nullHydrologic nullTrends
on nullthe nullSafe nullYield nullEstimate nullfor nullthe null
on nullthe nullSafe nullYield nullEstimate nullfor nullthe null
Ocklawaha nullRiver
Ocklawaha nullRiver
June 2010
SDI Environmental Services, Inc.
Putnam County Environmental Council, Inc.
Why is some of the text duplicated?
Because the text occurs twice in the PDF syntax.
For instance: (!"#$%&'\(%\)*+) stands for "Implications".
It occurs on two places:
coordinate 69.12, 742.98; and
coordinate 67.98239, 744.1204.
Because the two separate instances of the word are so close to each
other, you see it only once in the PDF, but that doesn't mean it isn't
there twice.
As for the 'null', you have some odd String <0001> that separates the
words in those first sentences.
Hope this helps.
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a
reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php