Re: [iText-questions] invalid strings when doing textextract.

1T3XT info Tue, 21 Sep 2010 04:27:59 -0700

On 20/09/2010 10:21, mp wrote:

I attach a new pdf. I use your code:
I get the same error.

OK, the err.pdf you've sent is completely different from the initialdocument. We've had a pointless discussion about the same subject on thelist a couple of weeks ago.


Somebody said: iText can't parse this file.
We replied: no tool can parse this file.
Then the OP got angry thinking we didn't want to help him,
although we tried to explain with hands and feet what had happened.

If you read chapter 2 of the book (I hope you took the time to do
that before starting an adventure with iText, just as you take
drivers lessons before you take place behind the wheel of a car),
you read:

Characters in a file are rendered on screen or on paper as glyphs.ISO-32000-1, section 9.2.1, states: "A character is an abstract symbol,whereas a glyph is a specific graphical rendering of a character. Forexample: The glyphs A, /A/, and *A* are renderings of the abstract ‘A’character. Glyphs are organized into fonts. A font defines glyphs for aparticular character set."


So a glyph on a page is not the same as a character.

Now let's skip to chapter 11:

"Glyphs in a simple font are selected using a single byte. Each glyphcorresponds to a character that has a value from 0 to 255. The mappingbetween the characters and the glyphs is called the character encoding."

If you have a language with more than 256 different glyphs, and you wantto use a font as a "simple font", it goes without saying that you'llneed a special encoding. The character A won't necessary be mapped to aglyph that looks like an A.


This is explained in chapter 15:

"It’s possible for a PDF to have a font containing characters thatappear in a content stream as a, b, c, and so on, but for which theshapes drawn in the PDF file show a completely different glyph, such asα, β, γ, and so on. An application can create a different encoding foreach specific PDF document—for example, in an attempt to obfuscate. Morelikely, the PDF-generating software does this deliberately, such as whena font with many characters is used but all the text can be shown usingonly 256 different glyphs. In this case, the software picks characternames at random according to the glyphs that are used."

Now if you use the example in attachment, you'll get the followingresult when parsing err.pdf:

<<1 ><1 ><2 ><3 ><4 5 ><6 ><2 ><7 ><5 ><8 8 9 ><6 ><a ><5 ><7 ><b ><7><5 ><8 8 9 ><6 ><1 ><2 ><3 ><4 ><5 ><6 ><2 ><7 ><8 ><5 ><4 ><9 ><4 ><5><2 ><2 ><a ><2 ><2 ><1 ><8 ><b ><2 ><8 ><a ><b ><c ><4 d ><3 ><6 ><e><f 4 d ><b ><2 ><4 5 ><8 >> and so on...


What do you see?

The software that created your PDF used the (char) 1 for the first glyphthat was added, (char) 2 for the second glyph, (char) 3 for the third,and so on...

There is no way for iText to know what the glyph corresponding with(char) 1 looks like. I mean: iText can find the paths that were used todraw the glyph (two concentric circles could be an O, two circles on topof each other could be an 8), and so on...


But iText doesn't do OCR, nor does any other F/OSS project.

iText does a good effort to parse PDF documents, and if you take thetime to get your driver's license, I mean: if you take the time to readthe book before asking questions, you fully understand that some PDFsfiles just can't be parsed.

import java.io.IOException;
import java.io.PrintStream;

import com.itextpdf.text.pdf.PdfDictionary;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.ContentByteUtils;
import com.itextpdf.text.pdf.parser.ImageRenderInfo;
import com.itextpdf.text.pdf.parser.PdfContentStreamProcessor;
import com.itextpdf.text.pdf.parser.RenderListener;
import com.itextpdf.text.pdf.parser.TextRenderInfo;
import com.itextpdf.text.pdf.parser.Vector;

public class PdfParse {

        class MyTextRenderListener implements RenderListener {

                /** The print writer to which the information will be written. 
*/
                protected PrintStream out;

                /**
                 * Creates a RenderListener that will look for text.
                 */
                public MyTextRenderListener(PrintStream printStream) {
                        this.out = printStream;
                }

                /**
                 * @see 
com.itextpdf.text.pdf.parser.RenderListener#beginTextBlock()
                 */
                public void beginTextBlock() {
                        out.print("<");
                }

                /**
                 * @see 
com.itextpdf.text.pdf.parser.RenderListener#endTextBlock()
                 */
                public void endTextBlock() {
                        out.println(">");
                }

                /**
                 * @see 
com.itextpdf.text.pdf.parser.RenderListener#renderImage(com.itextpdf.text.pdf.parser.ImageRenderInfo)
                 */
                public void renderImage(ImageRenderInfo renderInfo) {
                }

                /**
                 * @see 
com.itextpdf.text.pdf.parser.RenderListener#renderText(com.itextpdf.text.pdf.parser.TextRenderInfo)
                 */
                public void renderText(TextRenderInfo renderInfo) {
                        out.print("<");
                        /*out.print('(');
                        
out.print(renderInfo.getBaseline().getStartPoint().get(Vector.I1));
                        out.print(',');
                        
out.print(renderInfo.getBaseline().getStartPoint().get(Vector.I2));
                        out.print(')');
                        out.print(' ');*/
                        String text = renderInfo.getText();
                        for (int i = 0; i < text.length(); i++) {
                                out.print(Integer.toString(text.charAt(i), 16));
                                out.print(' ');
                        }
                        out.print(">");
                }
        }

        PdfParse() {
                PdfReader reader;
                try {
                        reader = new PdfReader("err.pdf");
                        RenderListener listener = new 
MyTextRenderListener(System.out);
                        PdfContentStreamProcessor processor = new 
PdfContentStreamProcessor(
                                        listener);
                        PdfDictionary pageDic = reader.getPageN(1);
                        PdfDictionary resourcesDic = 
pageDic.getAsDict(PdfName.RESOURCES);
                        
processor.processContent(ContentByteUtils.getContentBytesForPage(
                                        reader, 1), resourcesDic);
                } catch (IOException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                }
        }

        public static void main(String[] args) {
                new PdfParse();
        }
}

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Re: [iText-questions] invalid strings when doing textextract.

Reply via email to