Re: [iText-questions] Comments on com.lowagie.text.pdf.parseraddition to SVN

Nathan ALDRIDGE Thu, 06 Nov 2008 14:30:24 -0800

I'm not sure what your development environment is, but in Eclipse you can make 
project settings to ensure that you only compile JDK1.4 compatible stuff (i'm 
not entirely sure if that will trap everything though).

________________________________

From: Kevin Day [mailto:[EMAIL PROTECTED] 
Sent: November 6, 2008 13:33
To: [EMAIL PROTECTED]
Subject: Re: [iText-questions] Comments on com.lowagie.text.pdf.parseraddition 
to SVN

Nathan-

Good catch.  When I wrote the code, I didn't realize there was a 1.4 JDK 
requirement.  Bruno stripped out all of the generics stuff, but the hashCode 
call and the StringBuilder are definitely issues, and there are suitable 1.4 
compatible mechanisms.

I've got those corrected, but SVN is giving me a hard time about comitting the 
changes.  I'm getting a 403 Forbidden in response to MKACTIVITY request (maybe 
my SVN permissions aren't set up properly?)  As soon as SVN lets me commit, I'm 
pretty sure we'll be back to 1.4 compliance (I don't even have a 1.4 JVM on my 
system so I can't test - sorry!).

- K

----------------------- Original Message -----------------------

From: "Nathan ALDRIDGE" <[EMAIL PROTECTED]> <mailto:[EMAIL PROTECTED]> 
To: "Post all your questions about iText here" 
<[email protected]> 
<mailto:[email protected]> 
Cc: 
Date: Thu, 6 Nov 2008 12:49:03 -0800
Subject: Re: [iText-questions] Comments on com.lowagie.text.pdf.parser addition 
   to SVN

I understand that this is very bleeding edge but what is the policy for which 
JDK Itext is targetted at - some JDK 5.0 (Arrays.hashCode(float[]) and JDK6.0 
(StringBuilder)? Stuff has snuck in here that is definitely not JDK 1.4.

(also, thanks for doing something that has been in my 'get around to it' pile 
for a while now)

Regards,

Nathan

-----Original Message-----
From: Paulo Soares [mailto:[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>  
Sent: November 6, 2008 10:17
To: Post all your questions about iText here
Subject: Re: [iText-questions] Comments on com.lowagie.text.pdf.parser addition 
to SVN

Nevermind the limitations, what's important is to have something to work on and 
from what I saw there's already a lot of functionality available. A big thank 
you.

Paulo

________________________________________
From: Kevin Day [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> 
Sent: Thursday, November 06, 2008 4:40 PM
To: itext-questions-request@
Subject: [iText-questions] Comments on com.lowagie.text.pdf.parser addition     
to SVN

In the interest of self criticism, I thought I'd post some of the things that I 
think need work in the new content parser implementation:

1.  The CMapAwareDocumentFont, I suspect, is an ugly, ugly hack that would be 
better handled in a different way.  It seems to me that almost all fonts will 
have CMap type behavior - the ToUnicode tag is just one of those.  Perhaps I 
should have named the class ToUnicodeAwareDocumentFont - but I think that 
ultimately the correct solution here is going to be to build a CMap equivalent 
for any font object, regardless of whether ToUnicode is being used, or if the 
Font's internal CMap is being used.  I wonder if incorporating the actual CMap 
object into the fonts might be the way to appro ach this.

What makes this a bit tricky is that the iText font class implementations are 
really geared more towards PDF generation than PDF parsing (completely 
expected) - I don't particularly want to introduce code to those classes just 
to support parsing in the small # of scenarios where it will be occuring - but 
it seems like the right approach...

Along these lines, I know for certain that the current extractor does not 
properly handle PDFs generated by MS Word that use forward and backwards ticks 
(font is a TrueType TimesNewRomanPSMT, with WinAnsiEncoding encoding).  There 
is no ToUnicode map for these fonts, but the unicode that is used in the 
encoding doesn't render out to the actual tick marks (I get a trademark (TM) 
symbol for the tick, for example).  I suspect that the problem is that I am 
just blindly assuming that the encoding is standard unicode if a ToUnicode map 
is specified.

2.  Spatial analysis is fairly limited right no w.  For example, if content 
appears early on the page, but later in the content stream, the order of the 
extracted text is not consistent with the visual representation.  For our 
particular use of the parser, this is not an issue - but I could see where it 
might be important.  Fixing this would be relatively easy - tag each output 
line with it's Y position, then order the array before converting to a string.

3.  Vertical orientation of text is not handled at all.  At this point, I'm not 
entirely sure how to even detect that a content stream is performing vertical 
rendering (maybe this is part of the font metrics??)

4.  Content included from external objects may not be handled properly.  The 
canonical example here is adding Page X of Y to the bottom of each PDF page.  
The value for 'Y' is added as an external XObject.  I have done no testing with 
this, but it's quite possible that the reconstruction of the phrase 'Pag e 3 of 
7' might not work properly here.  We might get 'Page 3' in one place in the 
text, then '7' in another.  This comes back to spatial analysis.

5.  The algorithm for determing word separation is not as robust as it should 
be.  For example, if the font doesn't specify a width for character 32, the 
algorithm fails entirely.  Also, is dividing the char 32 width by 2 
appropriate?  And what character/word spacing adjustments should really be made 
to that width?

6.  Is the overall architecture of the parser appropriate?  Specifically, is 
passing in the ending text matrix to displayText() the best way to achieve the 
goal of detecting whether the next string is part of the previous string or not?

7.  Is the use of floats appropriate, or should we be using int (or long) and 
scale everything by 1000?  I used float primarily b/c I was concerned about 
overflow of the Matrix entries - but the current implementatio n is certainly 
slower than it could be.

8.  Are there any gross errors being made in reading objects from the 
PdfReader?  For example, have I made any mistakes in terms of loading the 
content stream in PdfTextExtractor#getContentBytesForPage()?  How about how I 
read the resource dictionary in PdfTextExtractor#getTextFromPage() - should I 
be doing anything to ensure that these resources don't consume memory after the 
page has been processed?

9.  How should unit testing be configured for this functionality?  It seems 
like we will wind up needing to tune some of these algorithms as users find 
situations where the text parsing doesn't work properly, and I think it's 
important to ensure that this does not break things.  I'm currently thinking 
something along the lines of having a test documents folder that contains a PDF 
and a .txt file containing the extracted results.  The test would then go 
through and ensure that the extraction matches.  Not sure how best to handle 
multiple pages with this.  Maybe a separate .txt file for each page?

Any and all comments/feedback are welcome.

- K

Aviso Legal:

Esta mensagem é destinada exclusivamente ao destinatário. Pode conter 
informação confidencial ou legalmente protegida. A incorrecta transmissão desta 
mensagem não significa a perca de confidencialidade. Se esta mensagem for 
recebida por engano, por favor envie-a de volta para o remetente e apague-a do 
seu sistema de imediato. É proibido a qualquer pessoa que não o destinatário de 
usar, revelar ou distribuir qualquer parte desta mensagem. 

Disclaimer:

This message is destined exclusively to the intended receiver. It may contain 
confidential or legally protected information. The incorrect transmission of 
this message does not mean the loss of its confidentiality. If this message is 
received by mistake , please send it back to the sender and delete it from your 
system immediately. It is forbidden to any person who is not the intended 
receiver to use, distribute or copy any part of this message.

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/ 
<http://moblin-contest.org/redirect.php?banner_id=100&url=/> 
_______________________________________________
iText-questions mailing list
[email protected] 
<mailto:[email protected]> 
https://lists.sourceforge.net/lists/listinfo/itext-questions 
<https://lists.sourceforge.net/lists/listinfo/itext-questions> 

Buy the iText book: http://www.1t3xt.com/docs/book.php 
<http://www.1t3xt.com/docs/book.php>

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Re: [iText-questions] Comments on com.lowagie.text.pdf.parseraddition to SVN

Reply via email to