On 20/09/2010 10:21, mp wrote:
I attach a new pdf. I use your code:
I get the same error.

OK, the err.pdf you've sent is completely different from the initial document. We've had a pointless discussion about the same subject on the list a couple of weeks ago.

Somebody said: iText can't parse this file.
We replied: no tool can parse this file.
Then the OP got angry thinking we didn't want to help him,
although we tried to explain with hands and feet what had happened.

If you read chapter 2 of the book (I hope you took the time to do
that before starting an adventure with iText, just as you take
drivers lessons before you take place behind the wheel of a car),
you read:
Characters in a file are rendered on screen or on paper as glyphs. ISO-32000-1, section 9.2.1, states: "A character is an abstract symbol, whereas a glyph is a specific graphical rendering of a character. For example: The glyphs A, /A/, and *A* are renderings of the abstract ‘A’ character. Glyphs are organized into fonts. A font defines glyphs for a particular character set."

So a glyph on a page is not the same as a character.

Now let's skip to chapter 11:
"Glyphs in a simple font are selected using a single byte. Each glyph corresponds to a character that has a value from 0 to 255. The mapping between the characters and the glyphs is called the character encoding."

If you have a language with more than 256 different glyphs, and you want to use a font as a "simple font", it goes without saying that you'll need a special encoding. The character A won't necessary be mapped to a glyph that looks like an A.

This is explained in chapter 15:
"It’s possible for a PDF to have a font containing characters that appear in a content stream as a, b, c, and so on, but for which the shapes drawn in the PDF file show a completely different glyph, such as α, β, γ, and so on. An application can create a different encoding for each specific PDF document—for example, in an attempt to obfuscate. More likely, the PDF-generating software does this deliberately, such as when a font with many characters is used but all the text can be shown using only 256 different glyphs. In this case, the software picks character names at random according to the glyphs that are used."

Now if you use the example in attachment, you'll get the following result when parsing err.pdf:

<<1 ><1 ><2 ><3 ><4 5 ><6 ><2 ><7 ><5 ><8 8 9 ><6 ><a ><5 ><7 ><b ><7 ><5 ><8 8 9 ><6 ><1 ><2 ><3 ><4 ><5 ><6 ><2 ><7 ><8 ><5 ><4 ><9 ><4 ><5 ><2 ><2 ><a ><2 ><2 ><1 ><8 ><b ><2 ><8 ><a ><b ><c ><4 d ><3 ><6 ><e ><f 4 d ><b ><2 ><4 5 ><8 >> and so on...

What do you see?
The software that created your PDF used the (char) 1 for the first glyph that was added, (char) 2 for the second glyph, (char) 3 for the third, and so on...

There is no way for iText to know what the glyph corresponding with (char) 1 looks like. I mean: iText can find the paths that were used to draw the glyph (two concentric circles could be an O, two circles on top of each other could be an 8), and so on...

But iText doesn't do OCR, nor does any other F/OSS project.
iText does a good effort to parse PDF documents, and if you take the time to get your driver's license, I mean: if you take the time to read the book before asking questions, you fully understand that some PDFs files just can't be parsed.
import java.io.IOException;
import java.io.PrintStream;

import com.itextpdf.text.pdf.PdfDictionary;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.ContentByteUtils;
import com.itextpdf.text.pdf.parser.ImageRenderInfo;
import com.itextpdf.text.pdf.parser.PdfContentStreamProcessor;
import com.itextpdf.text.pdf.parser.RenderListener;
import com.itextpdf.text.pdf.parser.TextRenderInfo;
import com.itextpdf.text.pdf.parser.Vector;

public class PdfParse {

        class MyTextRenderListener implements RenderListener {

                /** The print writer to which the information will be written. 
*/
                protected PrintStream out;

                /**
                 * Creates a RenderListener that will look for text.
                 */
                public MyTextRenderListener(PrintStream printStream) {
                        this.out = printStream;
                }

                /**
                 * @see 
com.itextpdf.text.pdf.parser.RenderListener#beginTextBlock()
                 */
                public void beginTextBlock() {
                        out.print("<");
                }

                /**
                 * @see 
com.itextpdf.text.pdf.parser.RenderListener#endTextBlock()
                 */
                public void endTextBlock() {
                        out.println(">");
                }

                /**
                 * @see 
com.itextpdf.text.pdf.parser.RenderListener#renderImage(com.itextpdf.text.pdf.parser.ImageRenderInfo)
                 */
                public void renderImage(ImageRenderInfo renderInfo) {
                }

                /**
                 * @see 
com.itextpdf.text.pdf.parser.RenderListener#renderText(com.itextpdf.text.pdf.parser.TextRenderInfo)
                 */
                public void renderText(TextRenderInfo renderInfo) {
                        out.print("<");
                        /*out.print('(');
                        
out.print(renderInfo.getBaseline().getStartPoint().get(Vector.I1));
                        out.print(',');
                        
out.print(renderInfo.getBaseline().getStartPoint().get(Vector.I2));
                        out.print(')');
                        out.print(' ');*/
                        String text = renderInfo.getText();
                        for (int i = 0; i < text.length(); i++) {
                                out.print(Integer.toString(text.charAt(i), 16));
                                out.print(' ');
                        }
                        out.print(">");
                }
        }

        PdfParse() {
                PdfReader reader;
                try {
                        reader = new PdfReader("err.pdf");
                        RenderListener listener = new 
MyTextRenderListener(System.out);
                        PdfContentStreamProcessor processor = new 
PdfContentStreamProcessor(
                                        listener);
                        PdfDictionary pageDic = reader.getPageN(1);
                        PdfDictionary resourcesDic = 
pageDic.getAsDict(PdfName.RESOURCES);
                        
processor.processContent(ContentByteUtils.getContentBytesForPage(
                                        reader, 1), resourcesDic);
                } catch (IOException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                }
        }

        public static void main(String[] args) {
                new PdfParse();
        }
}
------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Reply via email to